IIIT Hyderabad Publications |
|||||||||
|
Some Observations on Corpora of Some Indian LanguagesAuthors: Akshar Bharati,Sushama Bendre,Rajeev Sangal Conference: Published in Knowledge-Based Computer Systems, Tata McGraw-Hill, Dec. 1998. Date: 1998-12-31 Report no: IIIT/TR/1998/1 AbstractThis paper presents some data from seven Indian languages, as extracted from machine readable corpora. In particular, two kinds of data are extracted: (1) Number of distinct high-frequency words in a language needed to achieve a designated level of coverage of a random text. (2) Percentage of common words occuring in different languages. (A word is defined as a sequence of alphabetic characters delimited by space and punctuation marks.) The first data follows the expected trend that south Indian languages have much larger number of distinct words because they are more inflected than the north Indian languages and agglutinative in nature. Telugu turns out to be more so than Kannada. The second data shows that languages in adjacent regions have many more common words, as expected. Aim of the paper is to introduce scholars to the possibilities of using computers for extracting useful data about languages from machine readable corpora. Data presented here has been used to build a software for detecting the language of an unknown text. Full paper: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |