Basic Statistical Analaysis of Corpus and Cross Comparision

Authors: Akshar Bharati,Prakash Rao,Rajeev Sangal,S M Bendre
Conference: Published in the proceedings of ICON-2002: International Conference on Natural Language Processing, Mumbai, 18-21 Dec 2002

Date: 2002-12-18
Report no: IIIT/TR/2002/9

Abstract

This study presents some statistical analysis of ten machine-readable Indian languages corpora. The analysis is conducted using basic statistics like unigram frequencies, bigram frequencies, syllable frequencies, word length distribution and sentence length distribution in the corpora of ten Indian languages. The following information is extracted from the corpus [a) frequency of words and their percentages in the whole corpus [b) Number of distinct words required to cover a certain percentage of corpus [c) syllable frequencies and pattern extraction from syllables [d) entropy of words in the corpus [e) word length analysis using average word length, modal word length [f) sentence length analysis using average sentence length, modal sentence length etc. Also, a comparative analysis, based on basic statistics as mentioned above, is done on the corpora of Hindi language collected from different sources. The study shows how these basic statistics can be used to find the similarities and differences in Indian languages. It is intended to extend the analysis to compare grammar of the languages using morphological analyzers.

Full paper: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Basic Statistical Analaysis of Corpus and Cross Comparision

Abstract