Deeper Analysis and Comprehension of Documents including Contracts

Author: Hiranmai Sri Adibhatla 2018900044
Date: 2023-07-04
Report no: IIIT/TH/2023/118
Advisor:Manish Shrivastava

Abstract

The exchange of information between humans involves the use of speech and text and is known as Natural Language. Each day, people communicate with each other in various languages through speech or text, sharing a vast amount of information. The data produced through natural language communication can offer valuable insights despite being ambiguous, unstructured, and noisy. However, computers cannot interpret natural language on their own yet. To fully comprehend this data and respond intelligently, computers need the ability to understand and emulate human language. This is where Natural Language Processing (NLP) comes in - it is a branch of Artificial Intelligence that focuses on enabling machines to read, comprehend and derive meaning from human languages. NLP integrates the disciplines of linguistics and computer science. It decodes language, its structure, rules, and creates models capable of comprehending, analyzing, and extracting important information from both text and speech. An abundance of information is available in the form of text, including books, documents, articles, social media posts, and more. A document, one of the oldest forms of information exchange, refers to written, printed, or electronic material that is created to facilitate the exchange of information from its author to its intended audience. These files contain valuable information that can significantly benefit business activities. With the use of NLP applications, insights can be extracted from text data. Enterprises utilize NLP applications for various purposes, ranging from document understanding, information extraction, or providing answers to common questions. In this thesis, we develop techniques for deeper analysis and understanding of documents that are commonly used in enterprises. A contract is a frequently used type of document in the corporate world. Contracts are agreements between two or more parties, that govern what each party can or cannot do and are usually dense in information. Automatically extracting key components or components that contain rare or novel information from these large documents makes reviewing contracts easier. Nevertheless, it can be a challenging task as the key and novel components are not present in isolation within the contract. Extraction of significant components (key components + novel components) from contracts aims to simplify the end user’s comprehension and reduce dependency on legal experts for reviewing contracts. In this thesis, we introduce approaches for the automatic identification and extraction of significant components from a contract. We propose a Bidirectional Encoder Representations from Transformers (BERT) based model that automatically identifies or highlights significant components of a contract. In the corporate world, reports are also a frequently encountered type of document. A report is a document that provides information and analysis on a particular topic or issue. Reports are used to convey important information to stakeholders, such as managers, executives, investors, and customers. The vast data available in these reports have the potential to revolutionize datadriven analysis. Causality identification and span detection is one such data-driven task. The relationship between two entities where one causes another event to happen is known as cause and effect. We explored various transformer-based models that help in classifying sentences as well as identifying spans in a sentence.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Deeper Analysis and Comprehension of Documents including Contracts

Abstract