Beyond the Surface: A Computational Exploration of Linguistic Ambiguity

Author: Anmol Goel 2021701045
Date: 2023-06-27
Report no: IIIT/TH/2023/84
Advisor:Ponnurangam kumaraguru

Abstract

The issue of ambiguity in natural language poses a significant challenge to computational linguistics and natural language processing. Ambiguity arises when words or phrases can have multiple meanings, depending on the context in which they are used. In natural language processing, addressing the challenge of ambiguity is crucial for building more accurate and effective language models that can better reflect the complexity of human communication. In this thesis, we investigate two specific forms of linguistic ambiguities - polysemy, which is the multiplicity of meanings for a specific word, and tautology, which are seemingly uninformative and ambiguous phrases used in conversations. Both phenomena are widely-known manifestations of linguistic ambiguity - at the lexical and pragmatic level, respectively. The first part of the thesis focuses on addressing this challenge by proposing a new method for quantifying the degree of polysemy in words, which refers to the number of distinct meanings that a word can have. The proposed approach is a novel, unsupervised framework to compute and estimate polysemy scores for words in multiple languages, infusing syntactic knowledge in the form of dependency structures. The framework adopts a graph-based approach by computing the discrete Ollivier Ricci curvature on a graph of the contextual nearest neighbors. The effectiveness of the framework is demonstrated by significant correlations of the quantification with expert human-annotated language resources like WordNet. The proposed framework is tested on curated datasets controlling for different sense distributions of words in three typologically diverse languages - English, French, and Spanish. The framework leverages contextual language models and syntactic structures to empirically support the widely held theoretical linguistic notion that syntax is intricately linked to ambiguity/polysemy. The second part of the thesis explores how language models handle colloquial tautologies, a type of redundancy commonly used in conversational speech. Colloquial tautologies pose an additional challenge to language processing, as they involve the repetition of words or phrases that may appear redundant, but convey a specific meaning in a given context. We first present a dataset of colloquial tautologies and evaluate several state-of-the-art language models on this dataset using perplexity scores. We conduct probing experiments while controlling for the noun type, context and form of tautologies. The results reveal that BERT and GPT2 perform better with modal forms and human nouns, which aligns with previous literature and human intuition. We hope this work bolsters further research on ambiguity in language models. Our contributions have important implications for the development of more accurate and reliable natural language processing systems.

Full thesis: pdf

Centre for C2S2-Precog

IIIT Hyderabad Publications

Beyond the Surface: A Computational Exploration of Linguistic Ambiguity

Abstract