LLM Driven Web Profile Extraction for Identical Names & Connected Entities in Interlocking Directorships

Author: Prateek Sancheti 2019111041
Date: 2024-06-26
Report no: IIIT/TH/2024/118
Advisor:Kavita Vemuri,Kamalakar Karlapalem

Abstract

The number of individuals with identical names on the internet is increasing. Thus making the task of searching for a specific individual tedious. The user must vet through many profiles with identical names to get to the actual individual of interest. The online presence of an individual forms the profile of the individual. We need a solution that helps users by consolidating the profiles of such individuals by retrieving factual information available on the web and providing the same as a single result. We present a novel solution that retrieves web profiles belonging to those bearing identical Full Names through an end-to-end pipeline. Our solution involves information retrieval from the web (extraction), LLM-driven Named Entity Extraction (retrieval), and standardization of facts using Wikipedia, which returns profiles with fourteen multi-valued attributes. After that, profiles that correspond to the same real-world individuals are determined. We accomplish this by identifying similarities among profiles based on the extracted facts using a Prefix Tree inspired data structure (validation) and utilizing ChatGPT’s contextual comprehension (revalidation). The system offers varied levels of strictness while consolidating these profiles, namely strict, relaxed, and loose matching. The novelty of our solution lies in the innovative use of GPT – a highly powerful yet unpredictable tool for such a nuanced task. A study involving twenty participants and other results found that one could effectively authenticate information for a specific individual. Interlocking Directorships (IDs) have been an area of interest for researchers for several decades. Corporations sharing directors and directors sharing corporations result in connections in the corporate world. These connections carry dual-edged influences, from sharing resources and network expansion to collusion and quid-pro-quo. We present a systematic approach to identify frequently occurring groups of directors and companies and connected components within these corporate structures. We identify various weakly (maximal cliques) and strongly (maximal frequent itemsets) connected entities from these networks by extracting and analyzing a data corpus of over 55,000 Directors, 85,000 Companies, and over 3,00,000 Director-Company links. We also found that 37,123 companies out of a total 87,187 – almost 30%, have at least one pair of directors that share the same last name (possibly family-run companies). Finally, we also present a way to extract personal and professional relations between connected directors from the web with our LLM-driven profile extraction pipeline.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

LLM Driven Web Profile Extraction for Identical Names & Connected Entities in Interlocking Directorships

Abstract