IIIT Hyderabad Publications |
|||||||||
|
Harnessing Morphological Regularities for Representation Learning for Low Resourced LanguagesAuthor: Arihant Gupta Date: 2018-01-06 Report no: IIIT/TH/2018/1 Advisor:Manish Shrivastava AbstractMorphology is the branch of linguistics that deals with words, their internal structure, how they are formed, and their relationship to other words in the same language. It involves analyzing the structure of words and parts of words, such as stems, root words, prefixes, and suffixes. It also looks at parts of speech, intonation and stress, and the ways context can change a word’s pronunciation and meaning. In most languages, if not all, many words can be related to other words by rules that collectively describe the grammar for that language. For example, English speakers recognize that the words dog and dogs are closely related, differentiated only by the plurality morpheme “-s”, only found bound to nouns. With recent advancements in computational linguistics, we can now learn one-hot vector represen- tation for each word, also called word representations, from monolingual corpus of a language (training corpus). Word representations have been shown to contain syntactic as well as semantic (morphologi- cal) regularities. These word representations are being widely used to solve problems of various areas of natural language processing. These include but are not limited to dependency parsing, named entity recognition and parsing. One major requirement for learning good word representations (word embeddings) is large enough corpus to train. Size of training corpus directly affects the corresponding quality of word representations our model learns. Many languages, even though widely spoken, suffer from being computationally resource poor, which results in relatively poorer trained word embeddings. On top of it, morphologically rich languages suffer from morphologically induced data sparsity, since there are cases, where one morphological form of a word is common but other is rare in the same training corpus. Hence to learn better word representations for low resourced languages and to better exploit morpho- logical regularities present in distributional word representations, we present a language independent, unsupervised method for building word embeddings using morphological expansion of text by exploit- ing morphological regularities present in distributed word representations. Our model handles the prob- lem of data sparsity and yields improved word embeddings by relying on training word embeddings on artificially generated sentences. We evaluate our method using small sized training sets on eleven test sets for the word similarity task across seven languages. Further, for English, we evaluated the impacts of our approach using a large training set on three standard test sets. Our method improved results across all languages. We also present an unsupervised, language agnostic approach for exploiting morphological regular- ities present in high dimensional vector spaces. We propose a novel method for generating embeddings of words from their morphological variants using morphological transformation operators. We evaluate this approach on MSR word analogy test set with an accuracy of 85% which is 12% higher than the previous best known system. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |