Efficient Creation of Highly Compressed Inverted Index on a Single Node

Authors: Ashutosh Kumar
Conference: EEE International Conference on High Performance Computing (HiPC) Students Research Symposium 2013

Date: 2013-12-18
Report no: IIIT/TR/2013/47

Abstract

In this paper, we have tried to enhance the creation of inverted index by applying hashing and compression of data. We realized that hashing words to a number greatly reduces the runtime, as comparing numbers is inherently faster than strings. Hashing does not lead to observable collisions, as for the chosen hash function, the size of range is more than the square of number of English words. We also reasserted that using compression actually improves the runtime, as the process of indexing in IO bound, and we can utilize unused CPU cycles for reducing IO. As indexing is data parallel, we implemented a multi threaded indexer that independently indexes chunks of data for effective utilization of the available cores. With the fine tuned parallel and scalable implementation we were able to index 13 million pages of Wikipedia (41 GB XML) in about 9 minutes and the resultant index had a size of 1.9 GB achieving 95% compression. The index allows searching based on keywords in title, body, link or category of the page, and had an average response time of 10 ms. We compared our results with Apache Lucene, the indexer that Wikipedia uses, and noticed that it only achieved a compression of about 75% in twice the time. Other than indexing benefits, our solution had an implicit advantage in query optimization. Because of the small size, a much larger proportion of the final index could fit in main memory and queries could be answered rapidly without much involvement of the disk

Full paper: pdf

Centre for Security, Theory and Algorithms

IIIT Hyderabad Publications

Efficient Creation of Highly Compressed Inverted Index on a Single Node

Abstract