Organizational Unit:
School of Computational Science and Engineering

Research Organization Registry ID
Description
Previous Names
Parent Organization
Parent Organization
Organizational Unit
Includes Organization(s)

Publication Search Results

Now showing 1 - 5 of 5
  • Item
    Detection of frameshifts and improving genome annotation
    (Georgia Institute of Technology, 2012-11-12) Antonov, Ivan Valentinovich
    We developed a new program called GeneTack for ab initio frameshift detection in intronless protein-coding nucleotide sequences. The GeneTack program uses a hidden Markov model (HMM) of a genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm nds the maximum likelihood path that discriminates between true adjacent genes and a single gene with a frameshift. We tested GeneTack as well as two other earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn+Sp)/2 values, was higher by a signicant margin than the accuracy of the other two programs. GeneTack was used to screen 1,106 complete prokaryotic genomes and 206,991 genes with frameshifts (fs-genes) were identifed. Our goal was to determine if a frameshift transition was due to (i) a sequencing error, (ii) an indel mutation or (iii) a recoding event. We grouped 102,731 genes with frameshifts (fs-genes) into 19,430 clusters based on sequence similarity between their protein products (fs-proteins), conservation of predicted frameshift position, and its direction. While fs-genes in 2,810 clusters were classied as conserved pseudogenes and fs-genes in 1,200 clusters were classied as hypothetical pseudogenes, 5,632 fs-genes from 239 clusters pos- sessing conserved motifs near frameshifts were predicted to be recoding candidates. Experiments were performed for sequences derived from 20 out of the 239 clusters; programmed ribosomal frameshifting with eciency higher than 10% was observed for four clusters. GeneTack was also applied to 1,165,799 mRNAs from 100 eukaryotic species and 45,295 frameshifts were identied. A clustering approach similar to the one used for prokaryotic fs-genes allowed us to group 12,103 fs-genes into 4,087 clusters. Known programmed frameshift genes were among the obtained clusters. Several clusters may correspond to new examples of dual coding genes. We developed a web interface to browse a database containing all the fs-genes predicted by GeneTack in prokaryotic genomes and eukaryotic mRNA sequences. The fs-genes can be retrieved by similarity search to a given query sequence, by fs- gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to their likely origin, such as pseudogenization, phase variation, programmed frameshifts etc. All the tools and the database of fs-genes are available at the GeneTack web site http://topaz.gatech.edu/GeneTack/
  • Item
    Interest management scheme and prediction model in intelligent transportation systems
    (Georgia Institute of Technology, 2012-10-12) Li, Ying
    This thesis focuses on two important problems related to DDDAS: interest management (data distribution) and prediction models. In order to reduce communication overhead, we propose a new interest management mechanism for mobile peer-to-peer systems. This approach involves dividing the entire space into cells and using an efficient sorting algorithm to sort the regions in each cell. A mobile landmarking scheme is introduced to implement this sort-based scheme in mobile peer-to-peer systems. The design does not require a centralized server, but rather, every peer can become a mobile landmark node to take a server-like role to sort and match the regions. Experimental results show that the scheme has better computational efficiency for both static and dynamic matching. In order to improve communication efficiency, we present a travel time prediction model based on boosting, an important machine learning technique, and combine boosting and neural network models to increase prediction accuracy. We also explore the relationship between the accuracy of travel time prediction and the frequency of traffic data collection with the long term goal of minimizing bandwidth consumption. Several different sets of experiments are used to evaluate the effectiveness of this model. The results show that the boosting neural network model outperforms other predictors.
  • Item
    Personalized search and recommendation for health information resources
    (Georgia Institute of Technology, 2012-08-24) Crain, Steven P.
    Consumers face several challenges using the Internet to fill health-related needs. (1) In many cases, they face a language gap as they look for information that is written in unfamiliar technical language. (2) Medical information in social media is of variable quality and may be appealing even when it is dangerous. (3) Discussion groups provide valuable social support for necessary lifestyle changes, but are variable in their levels of activity. (4) Finding less popular groups is tedious. We present solutions to these challenges. We use a novel adaptation of topic models to address the language gap. Conventional topic models discover a set of unrelated topics that together explain the combinations of words in a collection of documents. We add additional structure that provides relationships between topics corresponding to relationships between consumer and technical medical topics. This allows us to support search for technical information using informal consumer medical questions. We also analyze social media related to eating disorders. A third of these videos promote eating disorders and consumers are twice as engaged by these dangerous videos. We study the interactions of two communities in a photo-sharing site. There, a community that encourages recovery from eating disorders interacts with the pro-eating disorder community in an attempt to persuade them, but we found that this attempt entrenches the pro-eating disorder community more firmly in its position. We study the process by which consumers participate in discussion groups in an online diabetes community. We develop novel event history analysis techniques to identify the characteristics of groups in a diabetes community that are correlated with consumer activity. This analysis reveals that uniformly advertise the popular groups to all consumers impairs the diversity of the groups and limits their value to the community. To help consumers find interesting discussion groups, we develop a system for personalized recommendation for social connections. We extend matrix factorization techniques that are effective for product recommendation so that they become suitable for implicit power-law-distributed social ratings. We identify the best approaches for recommendation of a variety of social connections involving consumers, discussion groups and discussions.
  • Item
    Analysis of the subsequence composition of biosequences
    (Georgia Institute of Technology, 2012-05-07) Cunial, Fabio
    Measuring the amount of information and of shared information in biological strings, as well as relating information to structure, function and evolution, are fundamental computational problems in the post-genomic era. Classical analyses of the information content of biosequences are grounded in Shannon's statistical telecommunication theory, while the recent focus is on suitable specializations of the notions introduced by Kolmogorov, Chaitin and Solomonoff, based on data compression and compositional redundancy. Symmetrically, classical estimates of mutual information based on string editing are currently being supplanted by compositional methods hinged on the distribution of controlled substructures. Current compositional analyses and comparisons of biological strings are almost exclusively limited to short sequences of contiguous solid characters. Comparatively little is known about longer and sparser components, both from the point of view of their effectiveness in measuring information and in separating biological strings from random strings, and from the point of view of their ability to classify and to reconstruct phylogenies. Yet, sparse structures are suspected to grasp long-range correlations and, at short range, they are known to encode signatures and motifs that characterize molecular families. In this thesis, we introduce and study compositional measures based on the repertoire of distinct subsequences of any length, but constrained to occur with a predefined maximum gap between consecutive symbols. Such measures highlight previously unknown laws that relate subsequence abundance to string length and to the allowed gap, across a range of structurally and functionally diverse polypeptides. Measures on subsequences are capable of separating only few amino acid strings from their random permutations, but they reveal that random permutations themselves amass along previously undetected, linear loci. This is perhaps the first time in which the vocabulary of all distinct subsequences of a set of structurally and functionally diverse polypeptides is systematically counted and analyzed. Another objective of this thesis is measuring the quality of phylogenies based on the composition of sparse structures. Specifically, we use a set of repetitive gapped patterns, called motifs, whose length and sparsity have never been considered before. We find that extremely sparse motifs in mitochondrial proteomes support phylogenies of comparable quality to state-of-the-art string-based algorithms. Moving from maximal motifs -- motifs that cannot be made more specific without losing support -- to a set of generators with decreasing size and redundancy, generally degrades classification, suggesting that redundancy itself is a key factor for the efficient reconstruction of phylogenies. This is perhaps the first time in which the composition of all motifs of a proteome is systematically used in phylogeny reconstruction on a large scale. Extracting all maximal motifs, or even their compact generators, is infeasible for entire genomes. In the last part of this thesis, we study the robustness of measures of similarity built around the dictionary of LZW -- the variant of the LZ78 compression algorithm proposed by Welch -- and of some of its recently introduced gapped variants. These algorithms use a very small vocabulary, they perform linearly in the input strings, and they can be made even faster than LZ77 in practice. We find that dissimilarity measures based on maximal strings in the dictionary of LZW support phylogenies that are comparable to state-of-the-art methods on test proteomes. Introducing a controlled proportion of gaps does not degrade classification, and allows to discard up to 20% of each input proteome during comparison.
  • Item
    Parallel algorithms for direct blood flow simulations
    (Georgia Institute of Technology, 2012-02-21) Rahimian, Abtin
    Fluid mechanics of blood can be well approximated by a mixture model of a Newtonian fluid and deformable particles representing the red blood cells. Experimental and theoretical evidence suggests that the deformation and rheology of red blood cells is similar to that of phospholipid vesicles. Vesicles and red blood cells are both area preserving closed membranes that resist bending. Beyond red blood cells, vesicles can be used to investigate the behavior of cell membranes, intracellular organelles, and viral particles. Given the importance of vesicle flows, in this thesis we focus in efficient numerical methods for such problems: we present computationally scalable algorithms for the simulation of dilute suspension of deformable vesicles in two and three dimensions. Our method is based on the boundary integral formulation of Stokes flow. We present new schemes for simulating the three-dimensional hydrodynamic interactions of large number of vesicles with viscosity contrast. The algorithms incorporate a stable time-stepping scheme, high-order spatiotemporal discretizations, spectral preconditioners, and a reparametrization scheme capable of resolving extreme mesh distortions in dynamic simulations. The associated linear systems are solved in optimal time using spectral preconditioners. The highlights of our numerical scheme are that (i) the physics of vesicles is faithfully represented by using nonlinear solid mechanics to capture the deformations of each cell, (ii) the long-range, N-body, hydrodynamic interactions between vesicles are accurately resolved using the fast multipole method (FMM), and (iii) our time stepping scheme is unconditionally stable for the flow of single and multiple vesicles with viscosity contrast and its computational cost-per-simulation-unit-time is comparable to or less than that of an explicit scheme. We report scaling of our algorithms to simulations with millions of vesicles on thousands of computational cores.