IIIT Hyderabad Publications |
|||||||||
|
Deep Learning for Detecting Inappropriate Content in TextAuthor: Harish Yenala Date: 2018-03-21 Report no: IIIT/TH/2018/13 Advisor:Manish Shrivastava,Manoj Chinnakotla AbstractA given piece of textual information produced by any user or agent is said to be inappropriate if the expressed intent may cause anger, annoyance to certain users or exhibits lack of respect, rudeness, discourteousness towards certain individuals/communities or may be capable of inflicting harm to oneself or others. A search engine should regulate its query completion suggestions by detecting and filtering suchqueries as it may hurt the user sentiments or may lead to legal issues thereby tarnishing the brand image. Hence, automatic detection and pruning of such inappropriate queries from completions and related search suggestions is an important problem for most commercial search engines. The problem is rendered difficult due to unique challenges posed by search queries such as lack of sufficient context, natural language ambiguity and presence of spelling mistakes and variations. In this Thesis, we propose a novel deep learning based technique for automatically identifying inap- propriate query suggestions. We propose a novel deep learning architecture called “Convolutional Bi-Directional LSTM (C-BiLSTM)” which combines the strengths of both Convolution Neural Networks (CNN) and Bi-directional LSTMs (BLSTM). Given a query, C-BiLSTM uses a convolutional layer for extracting feature representations for each query word which is then fed as input to the BLSTM layer which captures the various sequential patterns in the entire query and outputs a richer representation encoding them. The query representation thus learnt passes through a deep fully connected network which predicts the target class. C-BiLSTM doesn’t rely on hand-crafted features, is trained end-end as a single model, and effectively captures both local features as well as their global semantics. Evaluating C-BiLSTM on real-world search queries from a commercial search engine reveals that it significantly outperforms both pattern based and other hand-crafted feature based baselines. Moreover, C-BiLSTM also performs better than individual CNN, LSTM and BLSTM models trained for the same task. Rapid growth of chatbots and Interactive Gaming Systems expect large, real time, clean (non-inappropriate) human conversation data to train their models. High demand of social networking sites provide a platform for users to interact and share opinions publicly. Inappropriate conversations and posts in these platforms may also lead to loss of business and damaging company’s image. So we extended our idea to identify inappropriate conversations in chat data. We have applied various techniques to solve this problem and observed that BLSTM performs better than LSTM and Boosted Decision Tree approaches. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |