Title:
Analysis of the subsequence composition of biosequences

dc.contributor.advisor Apostolico, Alberto
dc.contributor.author Cunial, Fabio en_US
dc.contributor.committeeMember Gray, Alexander
dc.contributor.committeeMember Harvey, Steve
dc.contributor.committeeMember Heitsch, Christine
dc.contributor.committeeMember Istrail, Sorin
dc.contributor.department Computing en_US
dc.date.accessioned 2012-09-20T18:12:15Z
dc.date.available 2012-09-20T18:12:15Z
dc.date.issued 2012-05-07 en_US
dc.description.abstract Measuring the amount of information and of shared information in biological strings, as well as relating information to structure, function and evolution, are fundamental computational problems in the post-genomic era. Classical analyses of the information content of biosequences are grounded in Shannon's statistical telecommunication theory, while the recent focus is on suitable specializations of the notions introduced by Kolmogorov, Chaitin and Solomonoff, based on data compression and compositional redundancy. Symmetrically, classical estimates of mutual information based on string editing are currently being supplanted by compositional methods hinged on the distribution of controlled substructures. Current compositional analyses and comparisons of biological strings are almost exclusively limited to short sequences of contiguous solid characters. Comparatively little is known about longer and sparser components, both from the point of view of their effectiveness in measuring information and in separating biological strings from random strings, and from the point of view of their ability to classify and to reconstruct phylogenies. Yet, sparse structures are suspected to grasp long-range correlations and, at short range, they are known to encode signatures and motifs that characterize molecular families. In this thesis, we introduce and study compositional measures based on the repertoire of distinct subsequences of any length, but constrained to occur with a predefined maximum gap between consecutive symbols. Such measures highlight previously unknown laws that relate subsequence abundance to string length and to the allowed gap, across a range of structurally and functionally diverse polypeptides. Measures on subsequences are capable of separating only few amino acid strings from their random permutations, but they reveal that random permutations themselves amass along previously undetected, linear loci. This is perhaps the first time in which the vocabulary of all distinct subsequences of a set of structurally and functionally diverse polypeptides is systematically counted and analyzed. Another objective of this thesis is measuring the quality of phylogenies based on the composition of sparse structures. Specifically, we use a set of repetitive gapped patterns, called motifs, whose length and sparsity have never been considered before. We find that extremely sparse motifs in mitochondrial proteomes support phylogenies of comparable quality to state-of-the-art string-based algorithms. Moving from maximal motifs -- motifs that cannot be made more specific without losing support -- to a set of generators with decreasing size and redundancy, generally degrades classification, suggesting that redundancy itself is a key factor for the efficient reconstruction of phylogenies. This is perhaps the first time in which the composition of all motifs of a proteome is systematically used in phylogeny reconstruction on a large scale. Extracting all maximal motifs, or even their compact generators, is infeasible for entire genomes. In the last part of this thesis, we study the robustness of measures of similarity built around the dictionary of LZW -- the variant of the LZ78 compression algorithm proposed by Welch -- and of some of its recently introduced gapped variants. These algorithms use a very small vocabulary, they perform linearly in the input strings, and they can be made even faster than LZ77 in practice. We find that dissimilarity measures based on maximal strings in the dictionary of LZW support phylogenies that are comparable to state-of-the-art methods on test proteomes. Introducing a controlled proportion of gaps does not degrade classification, and allows to discard up to 20% of each input proteome during comparison. en_US
dc.description.degree PhD en_US
dc.identifier.uri http://hdl.handle.net/1853/44716
dc.publisher Georgia Institute of Technology en_US
dc.subject Subsequences en_US
dc.subject Compositional complexity en_US
dc.subject Phylogeny reconstruction en_US
dc.subject Alignment-free sequence comparison en_US
dc.subject Sparse motifs en_US
dc.subject LZW en_US
dc.subject LZWA en_US
dc.subject Variance computation en_US
dc.subject Protein domains en_US
dc.subject Proteomes en_US
dc.subject.lcsh Phylogeny
dc.subject.lcsh Polypeptides
dc.subject.lcsh Molecular biology
dc.subject.lcsh Algorithms
dc.title Analysis of the subsequence composition of biosequences en_US
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.corporatename College of Computing
local.contributor.corporatename School of Computational Science and Engineering
local.relation.ispartofseries Doctor of Philosophy with a Major in Computational Science and Engineering
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication 01ab2ef1-c6da-49c9-be98-fbd1d840d2b1
relation.isSeriesOfPublication d4852cba-4faa-473e-a20c-cb1b728ef27a
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
cunial_fabio_201208_phd.pdf
Size:
2.95 MB
Format:
Adobe Portable Document Format
Description: