Automatic Error Detection in Annotated Corpora

Author: Narendra Annamaneni
Date: 2017-07-21
Report no: IIIT/TH/2017/64
Advisor:Dipti Misra Sharma

Abstract

Annotated corpus is a linguistic resource which explicitly encodes the information at syntactic and semantic levels for each sentence. Annotated corpora play a crucial role in many applications of natural language processing (NLP). Error free and consistent annotated corpora is vital for these applications. Creating annotated corpora is an expensive and time consuming process. Errors or anomalies creep in due to human errors and sometimes because of multiple interpretations of the annotation guidelines. Maintaining the quality of the annotations is a challenging problem. This is because validating the annotated corpora and correcting these errors manually is an expensive and time consuming process. In particular, the validation process needs an expert’s time to detect and correct these errors, which is expensive. Hence, they need intelligent tools to automatically detect possible instances of errors in annotated corpora which they can validate quickly. Treebank annotation involves encoding information at POS, morph, chunk and dependency levels. Annotation requires a domain specific understanding of the language and dependency guidelines. Further, to validate the annotated corpora, we need experts of language and annotation guidelines. In this work, we address the problem of treebank validation and proposed novel approaches to detect errors automatically. To be specific, we address the issues at dependency level in the annotation process which is more vulnerable to errors due to complex rules in the dependency annotation schema. In our solution, we used ensembling methods on the parsers outputs. We hypothesize that the annotation and validation process should go in parallel rather than waiting for the entire corpus to be created. Our tool provides annotators error instances or inconsistent cases, so that they can clear the ambiguities in their understanding by reflecting on these small numbers of error instances. This process helps in early understanding of the errors committed in the annotation process. We also address the problem of skewed data sets, which is common in Indian languages by utilizing word embedding. Later, we attempt to build tools to correct the dependency errors automatically. Our work majorly investigated the error detection using dependency parsers and able to detect errors with an F-score values 71.18% and 42.19% respectively for Hindi and Telugu treebanks available. Our work includes some preliminary attempts to correct the errors automatically and we have increased the baseline precision of corpus from 88.59% to 92.29% for Hindi treebank.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Automatic Error Detection in Annotated Corpora

Abstract