Scalable mapping-free algorithms and K-mer data structures for genomic analysis

Thumbnail Image
Audano, Peter Anthony
Vannberg, Fredrik O.
Associated Organizations
Organizational Unit
Organizational Unit
Supplementary to
An organism’s DNA sequence is a virtual cornucopia of information, and sequencing technology is the key to unlocking it. The past few decades have been witness to meteoric growth of these technologies and the volume of data they produce. New industries are forming around it, existing ones are changing as a result of it, and modern medicine is on the precipice of a genomic revolution. Turning this deluge of data into useful information is already a challenging task with advances in sequencing technology far outstripping advances in semiconductors, and this trend shows no signs of stopping. Incremental advances are being made in sequence analysis techniques, but progress is far too slow to keep up with the volume of data delivered by modern sequencing platforms. This gap is often filled by allocating more computing resources in the form of distributed computing platforms, and this can quickly become prohibitive. Because medicine requires a quick answer and science often has limited funding, the analysis bottleneck is a major concern. Instead of finding new ways to dedicate more computing resources to the problem, I am interested in streamlining the process. In this dissertation, I explore methods to make the analysis faster and more efficient. My ultimate goal is to create algorithms that can run on a standard computer workstation, and when necessary, make the most of expensive distributed and cloud computing resources. Many analysis pipelines start by aligning sequence reads to a reference or assembling them into long consensus sequences, but it can take several hours to analyze a single sample on a workstation computer. Instead aligning or assembling sequence reads, the approaches described in this dissertation transform and analyze the sequence data without alignments. These alignment-free approaches often improve performance by an order of magnitude or more. The first step for many alignment-free approaches involves transforming the sequence reads into k-mers, which are short overlapping fragments of uniform size. If the size, k, is 48, then all substrings of length 48 are extracted from each sequence read and counted. The resulting frequency of each k-mer can then be used as evidence for genomic analysis techniques. When I started as graduate student, I began by working on a program to transform sequence data to k-mer frequencies. The resulting software, KAnalyze, was incredibly flexible because of its software architecture and approach to solving the k-mer counting problem. This became the foundation for the rest of my graduate work because it was possible to test new ideas that required data structures and transformations not commonly used today. After my initial work with k-mers, I was connected with the Centers for Disease Control and Prevention (CDC) Mycobacterium tuberculosis (M. tuberculosis) science team. They were interested in replacing existing software with an alignment-free approach based on k-mers. The new software significantly reduced analysis time, and it reduced errors. K-mers are more rigid than sequence aligments, so I had to find a way to correct for mutations in the samples. This resulted in a novel algorithm that could identify single nucleotide polymorphism (SNP) and insertion/deletion (indel) mutations, and it still took far less time than the alignment approach. I am not aware of an error correction algorithm with k-mers that does not employ a simple hamming distance calculation, and therefore, is capable of handling indel variants. The CDC M. tuberculosis project solved this problem in a naı̈ve way; when a variant was detected, it took 4 paths and assumed that it might be a SNP, insertion, deletion, or no variant. The result is an algorithm that runs with a computational time complexity of O(4n ). Although this worked well on the short reference sequences analyzed for this project, it would never scale to larger references. After some thought and experimentation, I realized that the reconstruction algorithm I had created was generating a sequence that was ostensibly related to the reference, and comparing two sequences to determine how well they match is the fundamental task of alignments. Instead of taking multiple paths for each base mismatch, could it be possible to guide the rebuilding algorithm by aligning the dynamically constructed sequence with the reference? Although it is not 100% alignment-free, this approach would significantly reduce the computational burden of modern methods, which aligns each of the sequence reads to the reference. What was more exciting is that it might be able to pick out variants blind to sequence read alignments, such as dense SNP loci or large insertions. This work culminated in Kestrel, which is a novel first-in-class variant caller application that uses this idea. The chapters that follow tell this story from the current state of the art to novel applications of this new technology that are under development today.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI