Understanding People in Low Resourced Languages

Author: Sahil Swami
Date: 2018-11-17
Report no: IIIT/TH/2018/94
Advisor:Manish Shrivastava

Abstract

Social media platforms like Twitter and Facebook have become two of the largest platforms for people to communicate and share their views with the people. The casual and informal environment on these platforms leads to more people expressing themselves in their native language which results in a larger amount of code-mixed data that the annotated set of data currently lacks. With access to public opinion on nearly every topic, we can gather a huge amount of user data which could prove to be useful for various companies, thus making tasks like opinion mining and sentiment analysis even more important. Hence understanding users in low resourced languages has become one of the most researched tasks of late. We present two English-Hindi code-mixed datasets and to evaluate these datasets we simultaneously build baseline classification systems to evaluate them. As it takes time to create the datasets we decided to test our classification system on another dataset of Spanish and Catalan tweets on Catalan Independence as Catlan is one of the low resourced languages when seen from the perspective of Natural Language Processing. Thus, we first present a supervised classification system for stance and gender detection in Spanish and Catalan tweets on Catalan Independence. Then we present two English-Hindi code-mixed corpus, one for stance detection and the other for sarcasm detection in code-mixed tweets. The tweets for stance detection are collected for the target ‘Demonetisation’ whereas the tweets for sarcasm detection are collected on various topics such as cricket, bollywood, and politics. Each tweet in their respective datasets is marked for the stance and presence of sarcasm. Each token in the tweets is annotated with a language tag. Finally, we present a classification system developed using these datasets for stance and sarcasm detection. This system uses various word and character level features along with three different classification techniques. 10-fold cross-validation is used for evaluation of this system.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Understanding People in Low Resourced Languages

Abstract