Efficient locality sensitive hashing: solutions, primitives, and applications

Author(s)
Meng, Jingfan
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computer Science
School established in 2007
Series
Supplementary to:
Abstract
Approximate nearest neighbor search (ANNS) is a fundamental algorithmic problem arising in many areas of computer science. Locality-sensitive hashing (LSH) is a longstanding and principled solution scheme for ANNS. In this dissertation, we aim to design new LSH-based ANNS solutions, to improve the performance of existing ones by designing new algorithmic primitives for them, and to explore new applications of LSH-related techniques. Our main contributions are threefold, specified as follows. First, we design new and more capable LSH-based solution approaches to ANNS in two challenging metrics, namely, after-linear-transform (ALT) and point-to-subspace (P2S) distance in Lp (1<=p<2). Our solutions greatly improve the indexing speed and the query efficiency over current baselines, respectively, by up to 1286 and 54.5 times. Second, we design new, efficient algorithmic primitives for LSH. We solve four open problems, namely efficient range-summability (ERS) in 1D random variables of arbitrary distributions, ERS in 2D and higher dimensions, approximately answering range-sum queries over a data cube, and fast computation of quadratic forms of a Gaussian orthogonal ensemble (FGoeQF). Our primitives considerably reduce the time complexity of all tasks. For example, in FGoeQF, we reduce the time complexity from O(d2) to O(dlogd), where d is the dimension of the dataset and can be up to 4096 in real-world data. Third, we explore new applications of LSH-related techniques to massive, high-dimensional datasets. We propose two new applications. The first is CanDE, a light-weight add-on to an LSH index for various insightful data analytics in the neighborhood of the query. The second is CommonSense, the first communication-efficient protocol for computing the intersection between two massive sets stored on different hosts. Based on compressed sensing (a randomized algorithm closely related to LSH), CommonSense reduces the communication cost of many applications by up to an order of magnitude, on a wide range of dataset sizes from one million to more than 290 million.
Sponsor
Date
2025-07-29
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI