Organizational Unit:
School of Computational Science and Engineering

Research Organization Registry ID
Previous Names
Parent Organization
Parent Organization
Organizational Unit
Includes Organization(s)

Publication Search Results

Now showing 1 - 10 of 88
  • Item
    AI-infused security: Robust defense by bridging theory and practice
    (Georgia Institute of Technology, 2019-09-20) Chen, Shang-Tse ; Chau, Duen Horng ; Balcan, Maria-Florina ; Lee, Wenke ; Song, Le ; Roundy, Kevin A. ; Cornelius, Cory ; Computational Science and Engineering
    While Artificial Intelligence (AI) has tremendous potential as a defense against real-world cybersecurity threats, understanding the capabilities and robustness of AI remains a fundamental challenge. This dissertation tackles problems essential to successful deployment of AI in security settings and is comprised of the following three interrelated research thrusts. (1) Adversarial Attack and Defense of Deep Neural Networks: We discover vulnerabilities of deep neural networks in real-world settings and the countermeasures to mitigate the threat. We develop ShapeShifter, the first targeted physical adversarial attack that fools state-of-the-art object detectors. For defenses, we develop SHIELD, an efficient defense leveraging stochastic image compression, and UnMask, a knowledge-based adversarial detection and defense framework. (2) Theoretically Principled Defense via Game Theory and ML: We develop new theories that guide defense resources allocation to guard against unexpected attacks and catastrophic events, using a novel online decision-making framework that compels players to employ ``diversified'' mixed strategies. Furthermore, by leveraging the deep connection between game theory and boosting, we develop a communication-efficient distributed boosting algorithm with strong theoretical guarantees in the agnostic learning setting. (3) Using AI to Protect Enterprise and Society: We show how AI can be used in real enterprise environment with a novel framework called Virtual Product that predicts potential enterprise cyber threats. Beyond cybersecurity, we also develop the Firebird framework to help municipal fire departments prioritize fire inspections. Our work has made multiple important contributions to both theory and practice: our distributed boosting algorithm solved an open problem of distributed learning; ShaperShifter motivated a new DARPA program (GARD); Virtual Product led to two patents; and Firebird was highlighted by National Fire Protection Association as a best practice for using data to inform fire inspections.
  • Item
    Optimizing resource allocation in computational sustainability: Models, algorithms and tools
    (Georgia Institute of Technology, 2021-01-21) Gupta, Amrita ; Dilkina, Bistra ; Chau, Duen Horng ; Catalyurek, Umit ; Fuller, Angela ; Morris, Dan ; Computational Science and Engineering
    The 17 Sustainable Development Goals laid out by the United Nations include numerous targets as well as indicators of progress towards sustainable development. Decision-makers tasked with meeting these targets must frequently propose upfront plans or policies made up of many discrete actions, such as choosing a subset of locations where management actions must be taken to maximize the utility of the actions. These types of resource allocation problems involve combinatorial choices and tradeoffs between multiple outcomes of interest, all in the context of complex, dynamic systems and environments. The computational requirements for solving these problems bring together elements of discrete optimization, large-scale spatiotemporal modeling and prediction, and stochastic models. This dissertation leverages network models as a flexible family of computational tools for building prediction and optimization models in three sustainability-related domain areas: 1) minimizing stochastic network cascades in the context of invasive species management; 2) maximizing deterministic demand-weighted pairwise reachability in the context of flood resilient road infrastructure planning; and 3) maximizing vertex-weighted and edge-weighted connectivity in wildlife reserve design. We use spatially explicit network models to capture the underlying system dynamics of interest in each setting, and contribute discrete optimization problem formulations for maximizing sustainability objectives with finite resources. While there is a long history of research on optimizing flows, cascades and connectivity in networks, these decision problems in the emerging field of computational sustainability involve novel objectives, new combinatorial structure, or new types of intervention actions. In particular, we formulate a new type of discrete intervention in stochastic network cascades modeled with multivariate Hawkes processes. In conjunction, we derive an exact optimization approach for the proposed intervention based on closed-form expressions of the objective functions, which is applicable in a broad swath of domains beyond invasive species, such as social networks and disease contagion. We also formulate a new variant of Steiner Forest network design, called the budget-constrained prize-collecting Steiner forest, and prove that this optimization problem possesses a specific combinatorial structure, restricted supermodularity, that allows us to design highly effective algorithms. In each of the domains, the optimization problem is defined over aspects that need to be predicted, hence we also demonstrate improved machine learning approaches for each.
  • Item
    Long read mapping at scale: Algorithms and applications
    (Georgia Institute of Technology, 2019-04-01) Jain, Chirag ; Aluru, Srinivas ; Konstantinidis, Konstantinos T. ; Catalyurek, Umit ; Phillippy, Adam M. ; Jordan, King ; Computational Science and Engineering
    Capability to sequence DNA has been around for four decades now, providing ample time to explore its myriad applications and the concomitant development of bioinformatics methods to support them. Nevertheless, disruptive technological changes in sequencing often upend prevailing protocols and characteristics of what can be sequenced, necessitating a new direction of development for bioinformatics algorithms and software. We are now at the cusp of the next revolution in sequencing due to the development of long and ultra-long read sequencing technologies by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Long reads are attractive because they narrow the scale gap between sizes of genomes and sizes of sequenced reads, with the promise of avoiding assembly errors and repeat resolution challenges that plague short read assemblers. However, long reads themselves sport error rates in the vicinity of 10-15%, compared to the high accuracy of short reads (< 1%). There is an urgent need to develop bioinformatics methods to fully realize the potential of long-read sequencers. Mapping and alignment of reads to a reference is typically the first step in genomics applications. Though long read technologies are still evolving, research efforts in bioinformatics have already produced many alignment-based and alignment-free read mapping algorithms. Yet, much work lays ahead in designing provably efficient algorithms, formally characterizing the quality of results, and developing methods that scale to larger input datasets and growing reference databases. While the current model to represent the reference as a collection of linear genomes is still favored due to its simplicity, mapping to graph-based representations, where the graph encodes genetic variations in a human population also becomes imperative. This dissertation work is focused on provably good and scalable algorithms for mapping long reads to both linear and graph references. We make the following contributions: 1. We develop fast and approximate algorithms for end-to-end and split mapping of long reads to reference genomes. Our work is the first to demonstrate scaling to the entire NCBI database, the collection of all curated and non-redundant genomes. 2. We generalize the mapping algorithm to accelerate the related problems of computing pairwise whole-genome comparisons. We shed light on two fundamental biological questions concerning genomic duplications and delineating microbial species boundaries. 3. We provide new complexity results for aligning reads to graphs under Hamming and edit distance models to classify the problem variants for which existence of a polynomial time solution is unlikely. In contrast to prior results that assume alphabets as a function of the problem size, we prove that the problem variants that allow edits in graph remain NP-complete for even constant-sized alphabets, thereby resolving computational complexity of the problem for DNA and protein sequence to graph alignments. 4. Finally, we propose a new parallel algorithm to optimally align long reads to large variation graphs derived from human genomes. It demonstrates near linear scaling on multi-core CPUs, resulting in run-time reduction from multiple days to three hours when aligning a long read set to an MHC human variation graph.
  • Item
    Parallel algorithms for direct blood flow simulations
    (Georgia Institute of Technology, 2012-02-21) Rahimian, Abtin ; Biros, George ; Alben, Silas ; Fernandez-Nieves, Alberto ; Hu, David ; Vuduc, Richard ; Computational Science and Engineering
    Fluid mechanics of blood can be well approximated by a mixture model of a Newtonian fluid and deformable particles representing the red blood cells. Experimental and theoretical evidence suggests that the deformation and rheology of red blood cells is similar to that of phospholipid vesicles. Vesicles and red blood cells are both area preserving closed membranes that resist bending. Beyond red blood cells, vesicles can be used to investigate the behavior of cell membranes, intracellular organelles, and viral particles. Given the importance of vesicle flows, in this thesis we focus in efficient numerical methods for such problems: we present computationally scalable algorithms for the simulation of dilute suspension of deformable vesicles in two and three dimensions. Our method is based on the boundary integral formulation of Stokes flow. We present new schemes for simulating the three-dimensional hydrodynamic interactions of large number of vesicles with viscosity contrast. The algorithms incorporate a stable time-stepping scheme, high-order spatiotemporal discretizations, spectral preconditioners, and a reparametrization scheme capable of resolving extreme mesh distortions in dynamic simulations. The associated linear systems are solved in optimal time using spectral preconditioners. The highlights of our numerical scheme are that (i) the physics of vesicles is faithfully represented by using nonlinear solid mechanics to capture the deformations of each cell, (ii) the long-range, N-body, hydrodynamic interactions between vesicles are accurately resolved using the fast multipole method (FMM), and (iii) our time stepping scheme is unconditionally stable for the flow of single and multiple vesicles with viscosity contrast and its computational cost-per-simulation-unit-time is comparable to or less than that of an explicit scheme. We report scaling of our algorithms to simulations with millions of vesicles on thousands of computational cores.
  • Item
    Calculation, utilization, and inference of spatial statistics in practical spatio-temporal data
    (Georgia Institute of Technology, 2017-08-02) Cecen, Ahmet ; Kalidindi, Surya R. ; Song, Le ; Garmestani, Hamid ; Chau, Duen Horng ; Kang, Sung H. ; Computational Science and Engineering
    The direct influence of spatial and structural arrangement in various length scales to the performance characteristics of materials is a core premise of materials science. Spatial correlations in the form of n-point statistics have been shown to be very effective in robustly describing the structural features of a plethora of materials systems, with a high number of cases where the obtained futures were successfully used to establish highly accurate and precise relationships to performance measures and manufacturing parameters. This work addresses issues in calculation, representation, inference and utilization of spatial statistics under practical considerations to the materials researcher. Modifications are presented to the theory and algorithms of the existing convolution based computation framework in order to accommodate deformed, irregular, rotated, missing or degenerate data, with complex or non-probabilistic state definitions. Memory efficient personal computer oriented implementations are discussed for the extended framework. A universal microstructure generation framework with the ability to efficiently address a vast variety of geometric or statistical constraints including those imposed by spatial statistics is assembled while maintaining scalability, and compatibility with structure generators in literature.
  • Item
    Towards Performance Portable Graph Algorithms
    (Georgia Institute of Technology, 2021-12-14) Yasar, Abdurrahman ; Çatalyürek, Ümit V. ; Vuduc, Richard ; Zhang, Xiuwei ; Sadayappan, Ponnuswamy ; Rajamanickam, Sivasankaran ; Computational Science and Engineering
    In today's data-driven world, our computational resources have become heterogeneous, making the processing of large-scale graphs in an architecture agnostic manner crucial. Traditionally, hand-optimized high-performance computing (HPC) solutions have been studied and used to implement highly efficient and scalable graph algorithms. In recent years, several graph processing and management systems have also been proposed. Hand optimized HPC approaches require high levels of expertise and graph processing frameworks suffer from expressibility and performance. Portability is a major concern for both approaches. The main thesis of this work is that block-based graph algorithms offer a compromise between efficient parallelism and architecture agnostic algorithm design for a wide class of graph problems. This dissertation seeks to prove this thesis by focusing the work on the three pillars; data/computation partitioning, block-based algorithm design, and performance portability. In this dissertation, we first show how we can partition the computation and the data to design efficient block-based algorithms for solving graph merging and triangle counting problems. Then, generalizing from our experiences, we propose an algorithmic framework, for shared-memory, heterogeneous machines for implementing block-based graph algorithms; PGAbB. PGAbB aims to maximally leverage different architectures by implementing a task-based execution on top of a block-based programming model. In this talk we will discuss PGAbB's programming model, algorithmic optimizations for scheduling, and load-balancing strategies for graph problems on real-world and synthetic inputs.
  • Item
    Techniques to improve genome assembly quality
    (Georgia Institute of Technology, 2019-03-28) Nihalani, Rahul ; Aluru, Srinivas ; Vuduc, Richard ; Jordan, King ; Wang, May Dongmei ; Catalyurek, Umit V. ; Computational Science and Engineering
    De-novo genome assembly is an important problem in the field of genomics. Discovering and analysing genomes of different species has numerous applications. For humans, it can lead to early detection of disease traits and timely prevention of diseases like cancer. In addition, it is useful in discovering genomes of unknown species. Even though it has received enormous attention in the last couple of decades, the problem remains unsolved to a satisfactory level, as shown in various scientific studies. Paired-end sequencing is a technology that sequences pairs of short strands from a genome, called reads. The pairs of reads originate from nearby genomic locations, and are commonly used to help more accurately determine the genomic location of individual reads and resolve repeats in genome assembly. In this thesis, we describe the genome assembly problem, and the key challenges involved in solving it. We discuss related work where we describe the two most popular models to approach the problem: de-Bruijn graphs and overlap graphs, along with their pros and cons. We describe our proposed techniques to improve the quality of genome assembly. Our main contribution in this work is designing a de-Bruijn graph based assembly algorithm to effectively utilize paired reads to improve genome assembly quality. We also discuss how our algorithm tackles some of the key challenges involved in genome assembly. We adapt this algorithm to design a parallel strategy to obtain high quality assembly for large datasets such as rice within reasonable time-frame. In addition, we describe our work on probabilistically estimating overlap graphs for large short reads datasets. We discuss the results obtained for our work, and conclude with some future work.
  • Item
    Parallel Algorithms and Generalized Frameworks for Learning Large-Scale Bayesian Networks
    (Georgia Institute of Technology, 2021-08-13) Srivastava, Ankit ; Aluru, Srinivas ; Catalyurek, Umit V ; Davenport, Mark A ; Dovrolis, Constantine ; Vuduc, Richard W ; Computational Science and Engineering
    Bayesian networks (BNs) are an important subclass of probabilistic graphical models that employ directed acyclic graphs to compactly represent exponential-sized joint probability distributions over a set of random variables. Since BNs enable probabilistic reasoning about interactions between the variables of interest, they have been successfully applied in a wide range of applications in the fields of medical diagnosis, gene networks, cybersecurity, epidemiology, etc. Furthermore, the recent focus on the need for explainability in human-impact decisions made by machine learning (ML) models has led to a push for replacing the prevalent black-box models with inherently interpretable models like BNs for making high-stakes decisions in hitherto unexplored areas. Learning the exact structure of BNs from observational data is an NP-hard problem and therefore a wide range of heuristic algorithms have been developed for this purpose. However, even the heuristic algorithms are computationally intensive. The existing software packages for BN structure learning with implementations of multiple algorithms are either completely sequential or support limited parallelism and can take days to learn BNs with even a few thousand variables. Previous parallelization efforts have focused on one or two algorithms for specific applications and have not resulted in broadly applicable parallel software. This has prevented BNs from becoming a viable alternative to other ML models. In this dissertation, we develop efficient parallel versions of a variety of BN learning algorithms from two categories: six different constraint-based methods and a score-based method for constructing a specialization of BNs known as module networks. We also propose optimizations for the implementations of these parallel algorithms to achieve maximum performance in practice. Our proposed algorithms are scalable to thousands of cores and outperform the previous state-of-the-art by a large margin. We have made the implementations available as open-source software packages that can be used by ML and application-domain researchers for expeditious learning of large-scale BNs.
  • Item
    Human-centered AI through scalable visual data analytics
    (Georgia Institute of Technology, 2019-11-01) Kahng, Minsuk Brian ; Chau, Duen Horng ; Navathe, Shamkant ; Endert, Alex ; Wattenberg, Martin ; Viégas, Fernanda ; Computational Science and Engineering
    While artificial intelligence (AI) has led to major breakthroughs in many domains, understanding machine learning models remains a fundamental challenge. How can we make AI more accessible and interpretable, or more broadly, human-centered, so that people can easily understand and effectively use these complex models? My dissertation addresses these fundamental and practical challenges in AI through a human-centered approach, by creating novel data visualization tools that are scalable, interactive, and easy to learn and to use. With such tools, users can better understand models by visually exploring how large input datasets affect the models and their results. Specifically, my dissertation focuses on three interrelated parts: (1) Unified scalable interpretation: developing scalable visual analytics tools that help engineers interpret industry-scale deep learning models at both instance- and subset-level (e.g., ActiVis deployed by Facebook); (2) Data-driven model auditing: designing visual data exploration tools that support discovery of insights through exploration of data groups over different analytics stages, such as model comparison (e.g., MLCube) and fairness auditing (e.g., FairVis); and (3) Learning complex models by experimentation: building interactive tools that broaden people's access to learning complex deep learning models (e.g., GAN Lab) and browsing raw datasets (e.g., ETable). My research has made significant impact to society and industry. The ActiVis system for interpreting deep learning models has been deployed on Facebook's machine learning platform. The GAN Lab tool for learning GANs has been open-sourced in collaboration with Google, with its demo used by more than 70,000 people from over 160 countries.
  • Item
    Efficient inference algorithms for network activities
    (Georgia Institute of Technology, 2015-01-08) Tran, Long Quoc ; Chau, Duen Horng ; Zha, Hongyuan ; Song, Le ; Sun, Jimeng ; Zhou, Haomin ; Gray, Alexander G. ; Computational Science and Engineering
    The real social network and associated communities are often hidden under the declared friend or group lists in social networks. We usually observe the manifestation of these hidden networks and communities in the form of recurrent and time-stamped individuals' activities in the social network. The inference of relationship between users/nodes or groups of users/nodes could be further complicated when activities are interval-censored, that is, when one only observed the number of activities that occurred in certain time windows. The same phenomenon happens in the online advertisement world where the advertisers often offer a set of advertisement impressions and observe a set of conversions (i.e. product/service adoption). In this case, the advertisers desire to know which advertisements best appeal to the customers and most importantly, their rate of conversions. Inspired by these challenges, we investigated inference algorithms that efficiently recover user relationships in both cases: time-stamped data and interval-censored data. In case of time-stamped data, we proposed a novel algorithm called NetCodec, which relies on a Hawkes process that models the intertwine relationship between group participation and between-user influence. Using Bayesian variational principle and optimization techniques, NetCodec could infer both group participation and user influence simultaneously with iteration complexity being O((N+I)G), where N is the number of events, I is the number of users, and G is the number of groups. In case of interval-censored data, we proposed a Monte-Carlo EM inference algorithm where we iteratively impute the time-stamped events using a Poisson process that has intensity function approximates the underlying intensity function. We show that that proposed simulated approach delivers better inference performance than baseline methods. In the advertisement problem, we propose a Click-to-Conversion delay model that uses Hawkes processes to model the advertisement impressions and thinned Poisson processes to model the Click-to-Conversion mechanism. We then derive an efficient Maximum Likelihood Estimator which utilizes the Minorization-Maximization framework. We verify the model against real life online advertisement logs in comparison with recent conversion rate estimation methods. To facilitate reproducible research, we also developed an open-source software package that focuses on various Hawkes processes proposed in the above mentioned works and prior works. We provided efficient parallel (multi-core) implementations of the inference algorithms using the Bayesian variational inference framework. To further speed up these inference algorithms, we also explored distributed optimization techniques for convex optimization under the distributed data situation. We formulate this problem as a consensus-constrained optimization problem and solve it with the alternating direction method for multipliers (ADMM). It turns out that using bipartite graph as communication topology exhibits the fastest convergence.