Organizational Unit:
School of Computational Science and Engineering

Research Organization Registry ID
Previous Names
Parent Organization
Parent Organization
Organizational Unit
Includes Organization(s)

Publication Search Results

Now showing 1 - 10 of 183
  • Item
    Distributed memory building blocks for massive biological sequence analysis
    (Georgia Institute of Technology, 2018-04-03) Pan, Tony C. ; Bader, David A. ; Aluru, Srinivas ; Catalyurek, Umit ; Vuduc, Richard ; Jordan, King ; Vannberg, Fredrick ; College of Computing ; School of Computational Science and Engineering ; Computational Science and Engineering
    K-mer indices and de Bruijn graphs are important data structures in bioinformatics with multiple applications ranging from foundational tasks such as error correction, alignment, and genome assembly, to knowledge discovery tasks including repeat detection and SNP identification. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days. The volume and velocity with which sequencing data is generated necessitate efficient algorithms and implementation of k-mer indices and de Bruijn graphs, two central components in bioinformatic applications. Existing applications that utilize k-mer counting and de Bruijn graphs, however, tend to provide embedded, specialized implementations. The research presented here represents efforts toward the creation of the first reusable, flexible, and extensible distributed memory parallel libraries for k-mer indexing and de Bruijn graphs. These libraries are intended for simplifying the development of bioinformatics applications for distributed memory environments. For each library, our goals are to create a set of API that are simple to use, and provide optimized implementations based on efficient parallel algorithms. We designed algorithms that minimize communication volume and latency, and developed implementations with better cache utilization and SIMD vectorization. We developed Kmerind, a k-mer counting and indexing library based on distributed memory hash table and distributed sorted arrays, that provide efficient insert, find, count, and erase operations. For de Bruijn graphs, we developed Bruno by leveraging Kmerind functionalities to support parallel de Bruijn graph construction, chain compaction, error removal, and graph traversal and element query. Our performance evaluations showed that Kmerind is scalable and high performance. Kmerind counted k-mers in a 120GB data set in less than 13 seconds on 1024 cores, and indexing the k-mer positions in 17 seconds. Using the Cori supercomputer and incorporating architecture aware optimizations as well as MPI-OpenMP hybrid computation and overlapped communication, Kmerind was able to count a 350GB data set in 4.1 seconds using 4096 cores. Kmerind has been shown to out-perform the state-of-the-art k-mer counting tools at 32 to 64 cores on a shared memory system. The Bruno library is built on Kmerind and implements efficient algorithms for construction, compaction, and error removal. It is capable of constructing, compacting,and generating unitigs for a 694GB human read data set in 7.3 seconds on 7680 Edison cores. It is 1.4X and 3.7X faster than its state-of-the-art alternatives in shared and distributed memory environments, respectively. Error removal in a graph constructed from an 162 GB data set completed in 13.1 and 3.91 seconds with frequency filter of 2 and 4 respectively on 16 nodes, totaling 512 cores. While our target domain is bioinformatics, we approached algorithm design and implementation with the aim for broader applicabilities in computer science and other application domains. As a result, our chain compaction and cycle detection algorithms can feasibly be applied to general graphs, and our distributed and sequential cache friendly hash tables as well as vectorized hash functions are generic and application neutral.
  • Item
    Learning from Multi-Source Weak Supervision for Neural Text Classification
    (Georgia Institute of Technology, 2020-07-28) Ren, Wendi ; Zhang, Chao ; Zhao, Tuo ; Navathe, Shamkant B. ; College of Computing ; School of Computational Science and Engineering ; Computational Science and Engineering
    Text classification is a fundamental text mining task with numerous real-life applications. While deep neural nets have achieved superior performance for text classification, they rely on large-scale labeled data to achieve strong performance. Obtaining large-scale labeled data, however, can be prohibitively expensive in many applications. In this project, we study the problem of learning neural text classifiers without using any labeled data, but only easy-to-provide heuristic rules as weak supervision. This problem is challenging because rule-induced weak labels are often noisy and incomplete. To address these challenges, we propose a model that can be learned from multiple weak supervision sources with two key components. The first component is a rule denoiser, which estimates conditional source reliability using a soft attention mechanism and reduces label noise by aggregating rule- induced noisy data. The second is a neural classifier that predicts soft labels for unmatchable samples to address the rule coverage issue. The two components are integrated into a co-training framework, which can be trained end-to-end to mutually enhance each other. We evaluate our model on five benchmarks for four popular text classification tasks, including sentiment analysis, topic classification, spam classification, and relation extraction. The results show that our model outperforms state-of-the-art weakly-supervised and semi-supervised methods, and achieves comparable performance with fully-supervised methods even without any labeled data.
  • Item
    MMAP: Mining Billion-Scale Graphs on a PC with Fast, Minimalist Approach via Memory Mapping
    (Georgia Institute of Technology, 2013) Sabrin, Kaeser Md. ; Lin, Zhiyuan ; Chau, Duen Horng ; Lee, Ho ; Kang, U. ; College of Computing ; School of Computational Science and Engineering ; Korea Advanced Institute of Science and Technology. Dept. of Computer Science
    Large graphs with billions of nodes and edges are increasingly common, calling for new kinds of scalable computation frameworks. State-of-the-art approaches such as GraphChi and TurboGraph recently demonstrated that a single PC can efficiently perform advanced computation on billion-node graphs. Although fast, they use sophisticated data structures, explicit memory management, and optimization techniques to achieve high speed and scalability. We propose a minimalist approach that forgoes such complexities, by leveraging the fundamental memory mapping (MMap) capability found on operating systems. We present multiple, major findings; we contribute: (1) our crucial insight that MMap can be a viable technique for creating fast, scalable graph algorithms that surpass some of the best techniques; (2) a counterintuitive result that we can do less and gain more ; MMap enables us to use a much simpler data structure (edge list) and algorithm design, and to defer memory management to the OS, while offering significantly faster or comparable performance as highly-optimized methods (e.g., 10 X as fast as GraphChi PageRank on 1.47 billion edge Twitter graph); (3) we performed extensive experiments on real and synthetic graphs, including the 6.6 billion edge YahooWeb graph, and show that MMap’s benefits sustain in most conditions. We hope this work will inspire others to explore how memory mapping may help improve other methods or algorithms to further increase their speed and scalability.
  • Item
    Intelligent hazard identification: Dynamic visibility measurement of construction equipment operators
    (Georgia Institute of Technology, 2014-03-26) Ray, Soumitry J. ; DesRoches, Reginald ; Chau, Duen Horng (Polo) ; Vela, Patricio A. ; Cho, Yong K. ; Narasimha, Rajesh ; School of Computational Science and Engineering ; Computational Science and Engineering
    Struck-by fatalities involving heavy equipment such as trucks and cranes accounted for 24.6% of the fatalities between 1997-2007 in the construction industry. Limited visibility due to blind spots and travel in reverse direction are the primary causes of these fatalities. Blind spots are spaces surrounding an equipment that are invisible to the equipment operator. Thus, a hazard is posed to the ground personnel working in the blind spaces of an equipment operator. This research presents a novel approach to intelligently identify potential hazards posed to workers operating near an equipment by determining the visible and blind space regions of an equipment operator in real-time. A depth camera is used to estimate the head posture of the equipment operator and continuously track the head location and orientation using Random Forests algorithm. The head posture information is then integrated with point cloud data of the construction equipment to determine both the visible and the blindspots region of the equipment operator using Ray-Casting algorithm. Simulation and field experiments were carried out to validate this approach in controlled and uncontrolled environment respectively. Research findings demonstrate the potential of this approach to enhance safety performance by detecting hazardous proximity situations.
  • Item
    Extending Hadoop to Support Binary-Input Applications
    (Georgia Institute of Technology, 2012-10-19) Hong, Bo ; College of Computing ; School of Computational Science and Engineering
    Many data-intensive applications naturally take multiple inputs, which is not well supported by some popular MapReduce implementations, such as Hadoop. In this talk, we present an extension of Hadoop to better support such applications. The extension is expected to provide the following benefits: (1) easy to program for such applications, (2) explores data localities better than native Hadoop, and (3) improves application performance.
  • Item
    Nonnegative matrix factorization for clustering
    (Georgia Institute of Technology, 2014-07-01) Kuang, Da ; Park, Haesun ; Chau, Duen Horng (Polo) ; Saltz, Joel ; Vuduc, Richard ; Zhou, Hao-Min ; College of Computing ; School of Computational Science and Engineering ; Computational Science and Engineering
    This dissertation shows that nonnegative matrix factorization (NMF) can be extended to a general and efficient clustering method. Clustering is one of the fundamental tasks in machine learning. It is useful for unsupervised knowledge discovery in a variety of applications such as text mining and genomic analysis. NMF is a dimension reduction method that approximates a nonnegative matrix by the product of two lower rank nonnegative matrices, and has shown great promise as a clustering method when a data set is represented as a nonnegative data matrix. However, challenges in the widespread use of NMF as a clustering method lie in its correctness and efficiency: First, we need to know why and when NMF could detect the true clusters and guarantee to deliver good clustering quality; second, existing algorithms for computing NMF are expensive and often take longer time than other clustering methods. We show that the original NMF can be improved from both aspects in the context of clustering. Our new NMF-based clustering methods can achieve better clustering quality and run orders of magnitude faster than the original NMF and other clustering methods. Like other clustering methods, NMF places an implicit assumption on the cluster structure. Thus, the success of NMF as a clustering method depends on whether the representation of data in a vector space satisfies that assumption. Our approach to extending the original NMF to a general clustering method is to switch from the vector space representation of data points to a graph representation. The new formulation, called Symmetric NMF, takes a pairwise similarity matrix as an input and can be viewed as a graph clustering method. We evaluate this method on document clustering and image segmentation problems and find that it achieves better clustering accuracy. In addition, for the original NMF, it is difficult but important to choose the right number of clusters. We show that the widely-used consensus NMF in genomic analysis for choosing the number of clusters have critical flaws and can produce misleading results. We propose a variation of the prediction strength measure arising from statistical inference to evaluate the stability of clusters and select the right number of clusters. Our measure shows promising performances in artificial simulation experiments. Large-scale applications bring substantial efficiency challenges to existing algorithms for computing NMF. An important example is topic modeling where users want to uncover the major themes in a large text collection. Our strategy of accelerating NMF-based clustering is to design algorithms that better suit the computer architecture as well as exploit the computing power of parallel platforms such as the graphic processing units (GPUs). A key observation is that applying rank-2 NMF that partitions a data set into two clusters in a recursive manner is much faster than applying the original NMF to obtain a flat clustering. We take advantage of a special property of rank-2 NMF and design an algorithm that runs faster than existing algorithms due to continuous memory access. Combined with a criterion to stop the recursion, our hierarchical clustering algorithm runs significantly faster and achieves even better clustering quality than existing methods. Another bottleneck of NMF algorithms, which is also a common bottleneck in many other machine learning applications, is to multiply a large sparse data matrix with a tall-and-skinny dense matrix. We use the GPUs to accelerate this routine for sparse matrices with an irregular sparsity structure. Overall, our algorithm shows significant improvement over popular topic modeling methods such as latent Dirichlet allocation, and runs more than 100 times faster on data sets with millions of documents.
  • Item
    Detection of frameshifts and improving genome annotation
    (Georgia Institute of Technology, 2012-11-12) Antonov, Ivan Valentinovich ; Borodovsky, Mark ; Baranov, Pavel ; Hammer, Brian ; Jordan, King ; Konstantinidis, Kostas ; Song, Le ; College of Computing ; School of Computational Science and Engineering ; Computational Science and Engineering
    We developed a new program called GeneTack for ab initio frameshift detection in intronless protein-coding nucleotide sequences. The GeneTack program uses a hidden Markov model (HMM) of a genomic sequence with possibly frameshifted protein-coding regions. The Viterbi algorithm nds the maximum likelihood path that discriminates between true adjacent genes and a single gene with a frameshift. We tested GeneTack as well as two other earlier developed programs FrameD and FSFind on 17 prokaryotic genomes with frameshifts introduced randomly into known genes. We observed that the average frameshift prediction accuracy of GeneTack, in terms of (Sn+Sp)/2 values, was higher by a signicant margin than the accuracy of the other two programs. GeneTack was used to screen 1,106 complete prokaryotic genomes and 206,991 genes with frameshifts (fs-genes) were identifed. Our goal was to determine if a frameshift transition was due to (i) a sequencing error, (ii) an indel mutation or (iii) a recoding event. We grouped 102,731 genes with frameshifts (fs-genes) into 19,430 clusters based on sequence similarity between their protein products (fs-proteins), conservation of predicted frameshift position, and its direction. While fs-genes in 2,810 clusters were classied as conserved pseudogenes and fs-genes in 1,200 clusters were classied as hypothetical pseudogenes, 5,632 fs-genes from 239 clusters pos- sessing conserved motifs near frameshifts were predicted to be recoding candidates. Experiments were performed for sequences derived from 20 out of the 239 clusters; programmed ribosomal frameshifting with eciency higher than 10% was observed for four clusters. GeneTack was also applied to 1,165,799 mRNAs from 100 eukaryotic species and 45,295 frameshifts were identied. A clustering approach similar to the one used for prokaryotic fs-genes allowed us to group 12,103 fs-genes into 4,087 clusters. Known programmed frameshift genes were among the obtained clusters. Several clusters may correspond to new examples of dual coding genes. We developed a web interface to browse a database containing all the fs-genes predicted by GeneTack in prokaryotic genomes and eukaryotic mRNA sequences. The fs-genes can be retrieved by similarity search to a given query sequence, by fs- gene cluster browsing, etc. Clusters of fs-genes are characterized with respect to their likely origin, such as pseudogenization, phase variation, programmed frameshifts etc. All the tools and the database of fs-genes are available at the GeneTack web site
  • Item
    Prokaryotic Gene Start Prediction: Algorithms for Genomes and Metagenomes
    (Georgia Institute of Technology, 2020-12-01) Gemayel, Karl ; Borodovsky, Mark ; Catalyurek, Umit ; Chau, Duen Horng ; Qiu, Peng ; Jordan, King ; College of Computing ; School of Computational Science and Engineering ; Computational Science and Engineering
    Prokaryotic gene-prediction is the task of finding genes in archaeal or bacterial DNA sequences. These genomes consist of alternating gene-coding and non-coding regions, meaning the task is solved by determining the start and end points of each gene in the DNA sequence, with gene-start prediction generally considered to be more difficult. The primary focus of this work is to improve gene-start prediction accuracy and our understanding of the biological translation-initiation mechanisms used to mark and determine gene-starts. There are two challenges that characterize this task. First, ground-truth, experimentally verified gene-starts are only available for a very small set of genes, and second, our knowledge of translation-initiation mechanisms is incomplete and quite often misleading. Three motivating questions arise from these challenges and are addressed in this work. First, how can we predict gene-starts in a DNA sequence without relying on ground-truth data and without any prior biological knowledge of that species? I show how simplifying assumptions about translation-initiation mechanisms biased the design of existing gene-finder algorithms hindering their predictive performance. I present GeneMarkS-2, an algorithm that relaxes those assumptions and learns more accurate representations of these mechanisms, thereby achieving more accurate predictions. Using it, I provide an updated view of the diversity of translation-initiation mechanisms across the prokaryotic domain. GeneMarkS-2 is now used by the National Center for Biotechnology Information (NCBI) to annotate their database of more than two hundred thousand prokaryotic genomes. Second, how can we measure the accuracy of gene-start prediction without access to ground-truth data? I show that the accuracy of existing methods measured on the limited set of verified data does not generalize to the much larger and more diverse set of available genes. This proves that these benchmark sets of verified starts are not representative enough for this task. I describe an alternative method to boost prediction performance for genes outside the ground-truth set by effectively filtering low-certainty predictions. This is done by only selecting gene-start predictions that are corroborated by multiple, independent sources of evidence. As part of this approach, I propose StartLink, a new comparative genomics approach for gene-start prediction; that is, comparing DNA fragments from multiple species rather than relying solely on a single genome. Third, how can we predict gene-starts for metagenomes, i.e. cases where frequently only part of the DNA sequence is available? Here, I describe how the mechanisms for gene-start prediction developed for GeneMarkS-2 can be ported to metagenomes, which often have short DNA fragments that hinder the performance of predictive methods. I present MetaGeneMarkS, and show that it achieves accuracies on metagenomes close to those achieved by GeneMarkS-2 on fully-sequenced DNA. Several recurring themes appear throughout this work. Understanding the limits of our knowledge of translation-initiation mechanisms proves essential to designing better models and provides an open field of new exploration of the diversity of these mechanisms. Furthermore, our unhealthy dependence on verified gene-starts for measuring performance has and continues to prevent us from accurately portraying the quality of our predictors, despite the >95% average accuracy levels measured on this set. It is therefore critical to restate that gene-start prediction is still an open problem.
  • Item
    Interactive Visual Text Analytics
    (Georgia Institute of Technology, 2020-12-07) Kim, Hannah ; Park, Haesun ; Endert, Alex ; Chau, Duen Horng ; Zhang, Chao ; Cao, Nan ; College of Computing ; School of Computational Science and Engineering ; Computational Science and Engineering
    Human-in-the-Loop machine learning leverages both human and machine intelligence to build a smarter model. Even with the advances in machine learning techniques, results generated by automated models can be of poor quality or do not always match users' judgment or context. To this end, keeping human in the loop via right interfaces to steer the underlying model can be highly beneficial. Prior research in machine learning and visual analytics has focused on either improving model performances or developing interactive interfaces without carefully considering the other side. In this dissertation, we design and develop interactive systems that tightly integrate algorithms, visualizations, and user interactions, focusing on improving interactivity, scalability, and interpretability of the underlying models. Specifically, we present three visual analytics systems to explore and interact with large-scale text data. First, we present interactive hierarchical topic modeling for multi-scale analysis of large-scale documents. Second, we introduce interactive search space reduction to discover relevant subset of documents with high recall for focused analyses. Lastly, we propose interactive exploration and debiasing of word embeddings.
  • Item
    Enabling collaborative behaviors among cubesats
    (Georgia Institute of Technology, 2011-07-08) Browne, Daniel C. ; Russell, Ryan P. ; Bishop, Carlee ; Vuduc, Richard ; West, Michael ; College of Computing ; School of Computational Science and Engineering ; Computational Science and Engineering
    Future spacecraft missions are trending towards the use of distributed systems or fractionated spacecraft. Initiatives such as DARPA's System F6 are encouraging the satellite community to explore the realm of collaborative spacecraft teams in order to achieve lower cost, lower risk, and greater data value over the conventional monoliths in LEO today. Extensive research has been and is being conducted indicating the advantages of distributed spacecraft systems in terms of both capability and cost. Enabling collaborative behaviors among teams or formations of pico-satellites requires technology development in several subsystem areas including attitude determination and control subsystems, orbit determination and maintenance capabilities, as well as a means to maintain accurate knowledge of team members' position and attitude. All of these technology developments desire improvements (more specifically, decreases) in mass and power requirements in order to fit on pico-satellite platforms such as the CubeSat. In this thesis a solution for the last technology development area aforementioned is presented. Accurate knowledge of each spacecraft's state in a formation, beyond improving collision avoidance, provides a means to best schedule sensor data gathering, thereby increasing power budget efficiency. Our solution is composed of multiple software and hardware components. First, finely-tuned flight system software for the maintaining of state knowledge through equations of motion propagation is developed. Additional software, including an extended Kalman filter implementation, and commercially available hardware components provide a means for on-board determination of both orbit and attitude. Lastly, an inter-satellite communication message structure and protocol enable the updating of position and attitude, as required, among team members. This messaging structure additionally provides a means for payload sensor and telemetry data sharing. In order to satisfy the needs of many different missions, the software has the flexibility to vary the limits of accuracy on the knowledge of team member position, velocity, and attitude. Such flexibility provides power savings for simpler applications while still enabling missions with the need of finer accuracy knowledge of the distributed team's state. Simulation results are presented indicating the accuracy and efficiency of formation structure knowledge through incorporation of the described solution. More importantly, results indicate the collaborative module's ability to maintain formation knowledge within bounds prescribed by a user. Simulation has included hardware-in-the-loop setups utilizing an S-band transceiver. Two "satellites" (computers setup with S-band transceivers and running the software components of the collaborative module) are provided GPS inputs comparable to the outputs provided from commercial hardware; this partial hardware-in-the-loop setup demonstrates the overall capabilities of the collaborative module. Details on each component of the module are provided. Although the module is designed with the 3U CubeSat framework as the initial demonstration platform, it is easily extendable onto other small satellite platforms. By using this collaborative module as a base, future work can build upon it with attitude control, orbit or formation control, and additional capabilities with the end goal of achieving autonomous clusters of small spacecraft.