Title:
Algorithmic Techniques for Variant Selection in Genome Graphs
Algorithmic Techniques for Variant Selection in Genome Graphs
Author(s)
Tavakoli, Neda
Advisor(s)
Aluru, Srinivas
Editor(s)
Collections
Supplementary to
Permanent Link
Abstract
Variation graph representations are projected to either replace or supplement conven- tional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogs of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. This dissertation research takes a holistic approach to develop a novel math- ematical framework for variant selection in genome graphs subject to preserving sequences of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants (e.g., SNPs, indels, or structural variants) and whether the goal is to minimize the number of positions at which variants are incorporated or to mini- mize the total number of variants incorporated. We classify the computational complexity of these problems and provide efficient algorithms along with their software implemen- tations. We also empirically evaluate run-time performance and reduction in number of variants achieved by the multiple algorithms proposed in this dissertation research. The designed mathematical framework aims to address the following prominent problem cases.
First, we develop novel algorithms and complexity results for preserving sequences of length α represented by paths in the complete variation graph, under the Hamming distance and edit distance metrics. We also assess the extent of variant reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to characteristics of short and long-read resequencing technologies. Additionally, we estab- lish benchmark data sets and tools to empirically evaluate the impact of variant selection on read-to-graph mapping.
Next, we consider the problem of preserving α-long substrings of the input haplotypes, the so-called haplotype-aware variant selection problem, under the Hamming distance met- ric. We show that this problem is NP-hard and develop an Integer Linear Programming (ILP) solution for it. The solution is effective in finding optimal solutions even on hu-
xx
man chromosome-scale graphs and for a variety of sequence lengths and error percentage thresholds. In addition to ensuring optimality, our results demonstrate that a substantial additional reduction in the number of selected variants can be achieved when restricting preserved sequences to individual haplotypes.
Finally, we develop the haplotype-aware variant selection problem under the edit dis- tance metric, where the input haplotypes may contain SNPs, indels, and structural variants. We designed a two-level ILP formulation for the problem. We implement a software frame- work for this problem and empirically evaluate its effectiveness in finding optimal solutions on human chromosome-scale graphs and for a variety of sequence lengths and error per- centage thresholds. We also experimentally evaluate the impact of the variant reduction obtained here on sequence-to-graph mapping accuracy. Taken together, these formulations and results are expected to significantly advance our knowledge on the creation and use of genome graphs.
Sponsor
Date Issued
2023-08-25
Extent
Resource Type
Text
Resource Subtype
Dissertation