Fundamental Limits and Algorithms for Database and Graph Alignment

Thumbnail Image
Dai, Osman Emre
Kiyavash, Negar
Singh, Mohit
Associated Organization(s)
Supplementary to
Data alignment refers to a class of problems where given two sets of anonymized data pertaining to overlapping sets of users, the goal is to identify the correspondences between the two sets. If the data of a user is contained in both sets, the correlation between the two data points associated with the user might make it possible to determine that both belong to the same user and hence link the data points. Alignment problems are of practical interest in applications such as privacy and data junction. Data alignment can be used to de-anonymize data, therefore, studying the feasibility of alignment allows for a more reliable understanding of the limitations of anonymization schemes put in place to protect against privacy breaches. Additionally, data alignment can aid in finding the correspondence between data from different sources, e.g. different sensors. The data fusion performed through data alignment in turn can help with variety of inference problems that arise in scientific and engineering applications. This thesis considers two types of data alignment problems: database and graph alignment. Database alignment refers to the setting where each feature (i.e. data points) in a data set is associated with a single user. Graph alignment refers to the setting where data points in each data set are associated with pairs of users. For both problems, we are particularly interested in the asymptotic case where n, the number of users with data in both sets, goes to infinity. Nevertheless our analyses often yield results applicable to the finite n case. To develop a preliminary understanding of the database alignment problem, we first study the closely related problem of planted matching with Gaussian weights of unit variance, and derive tight achievability bounds that match our converse bounds: Specifically we identify different inequalities between log n and the signal strength (which corresponds to the square of the difference between the mean weights of planted and non-planted edges) that guarantee upper bounds on the log of the expected number of errors. Then, we study the database alignment problem with Gaussian features in the low per-feature correlation setting where the number of dimensions of each feature scales as ω(log n): We derive inequalities between log n and signal strength (which, for database alignment, corresponds to the mutual information between correlated features) that guarantee error bounds matching those of the planted matching setting, supporting the claimed connection between the two problems. Then, relaxing the restriction on the number of dimensions of features, we derive conditions on signal strength and dimensionality that guarantee smaller upper bounds on the log of the expected number of errors. The stronger results in the O(log n)-dimensional-feature setting for Gaussian databases show how planted matching, while useful, is not a perfect substitute to understand the dynamics of the more complex problem of database alignment. For graph alignment, we focus on the correlated Erdős–Rényi graph model where the data point (i.e. edge) associated with each pair of users in a graph is a Bernoulli random variable that is correlated with the data point associated with the same pair in the other graph. We study a canonical labeling algorithm for alignment and identify conditions on the density of the graphs and correlation between edges across graphs that guarantees the recovery of the true alignment with high probability.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI