Leveraging SOA Principles in Development of Modular & Scalable NLP Pipelines

Author: Nehal Jagdish Wani Jagdish Wani
Date: 2019-05-14
Report no: IIIT/TH/2019/52
Advisor:Dipti Misra Sharma,Suresh Purini

Abstract

Installation, deployment and maintenance of an NLP system, can be a daunting dask; based on the number and complexity of the components involved. One such system can be a hybrid Machine Translation system, composed of several modules which define the transformation of a given word, phrase or sentence, from one language to another. The end users of such a system can be developers themselves who want to improve it. To achieve this, the system as whole, needs to adopt an architecture which lets the users control and change the order of components in the pipeline, be able to intervene the execution in middle to debug one or more components, tweak inputs/outputs without having to rewrite the components, be able to easily replicate the system on their local box without having to worry about the hassle of compiling everything from source, be able to quickly replace a component with a higher or lower version and be able to see the impact on the final result quickly; basically make development iterative and fast. The system should also expose an interface for those users who want to build something on top of it, without having to worry about the internal details. The ideas proposed in this thesis try to cater to the needs of a broad category of users, in an attempt to keep their work-flow as simple as possible. We propose an architecture where we show how to identify and transform a monolithic system into small, individual components (each being a linguistic unit), identify bottlenecks from an operating system’s point of view, identify scalable components and finally provide an easy mechanism to interact with the system. To achieve this, we apply the proposed architecture over an existing system (Sampark MT System) and walk through it’s transformation. Towards the end, we show the creation of a web client which shows how easy it becomes to interact with the modified system. We also apply our proposition to show that it can be applied to any pipeline based system by thinking of it as a disconnected, directed acyclic graph. We also show how the modified system can be deployed on the cloud easily and how individual components can be scaled up or down as per needs. To be able to plan the overall architecture and produce guidelines for enabling large scale collaborative development, a structured systems analysis and design, from the point of view of both, a computational linguist and a systems engineer is required. This thesis provides the foundation in that direction by enhancing an existing system, reducing the overall runtime of it’s components by greater than 85%, improving the test-dev-deploy cycle for computational linguists and discussing a generalized architecture on top of which, further complex systems can be built for specific purposes.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Leveraging SOA Principles in Development of Modular & Scalable NLP Pipelines

Abstract