IIIT Hyderabad Publications |
|||||||||
|
A Corpus Factory for many languagesAuthors: Adam Kilgarriff,Siva Reddy,Jan Pomikálek, Avinesh PVS Conference: Seventh conference on International Language Resources and Evaluation (LREC'10), European Language Resources Association (ELRA) (LREC10 2010) Date: 2010-05-17 Report no: IIIT/TR/2010/67 AbstractFor many languages there are no large, general-language corpora available. Until the web, all but the richest institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a `corpus factory' where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. The corpora we have developed are available for use in the Sketch Engine corpus query tool. Full paper: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |