Combinatorial optimization and application to DNA sequence analysis

Thumbnail Image
Gupta, Kapil
Lee, Eva K.
Associated Organizations
Supplementary to
With recent and continuing advances in bioinformatics, the volume of sequence data has increased tremendously. Along with this increase, there is a growing need to develop efficient algorithms to process such data in order to make useful and important discoveries. Careful analysis of genomic data will benefit science and society in numerous ways, including the understanding of protein sequence functions, early detection of diseases, and finding evolutionary relationships that exist among various organisms. Most sequence analysis problems arising from computational genomics and evolutionary biology fall into the class of NP-complete problems. Advances in exact and approximate algorithms to address these problems are critical. In this thesis, we investigate a novel graph theoretical model that deals with fundamental evolutionary problems. The model allows incorporation of the evolutionary operations ``insertion', ``deletion', and ``substitution', and various parameters such as relative distances and weights. By varying appropriate parameters and weights within the model, several important combinatorial problems can be represented, including the weighted supersequence, weighted superstring, and weighted longest common sequence problems. Consequently, our model provides a general computational framework for solving a wide variety of important and difficult biological sequencing problems, including the multiple sequence alignment problem, and the problem of finding an evolutionary ancestor of multiple sequences. In this thesis, we develop large scale combinatorial optimization techniques to solve our graph theoretical model. In particular, we formulate the problem as two distinct but related models: constrained network flow problem and weighted node packing problem. The integer programming models are solved in a branch and bound setting using simultaneous column and row generation. The methodology developed will also be useful to solve large scale integer programming problems arising in other areas such as transportation and logistics.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI