Organizational Unit:

School of Computational Science and Engineering

Permanent Link

https://hdl.handle.net/1853/70780

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/1111

Full item page

Publication Search Results

Now showing 1 - 10 of 43

A framework for automated management of exploit testing environments

(Georgia Institute of Technology, 2015-12-07) Flansburg, Kevin

To demonstrate working exploits or vulnerabilities, people often share their findings as a form of proof-of-concept (PoC) prototype. Such practices are particularly useful to learn about real vulnerabilities and state-of-the-art exploitation techniques. Unfortunately, the shared PoC exploits are seldom reproducible; in part because they are often not thoroughly tested, but largely because authors lack a formal way to specify the tested environment or its dependencies. Although exploit writers attempt to overcome such problems by describing their dependencies or testing environments using comments, this informal way of sharing PoC exploits makes it hard for exploit authors to achieve the original goal of demonstration. More seriously, these non- or hard-to-reproduce PoC exploits have limited potential to be utilized for other useful research purposes such as penetration testing, or in benchmark suites to evaluate defense mechanisms. In this paper, we present XShop, a framework and infrastructure to describe environments and dependencies for exploits in a formal way, and to automatically resolve these constraints and construct an isolated environment for development, testing, and to share with the community. We show how XShop's flexible design enables new possibilities for utilizing these reproducible exploits in five practical use cases: as a security benchmark suite, in pen-testing, for large scale vulnerability analysis, as a shared development environment, and for regression testing. We design and implement such applications by extending the XShop framework and demonstrate its effectiveness with twelve real exploits against well-known bugs that include GHOST, Shellshock, and Heartbleed. We believe that the proposed practice not only brings immediate incentives to exploit authors but also has the potential to be grown as a community-wide knowledge base.
Unsupervised learning of disease subtypes from continuous time Hidden Markov Models of disease progression

(Georgia Institute of Technology, 2015-08-21) Gupta, Amrita

The detection of subtypes of complex diseases has important implications for diagnosis and treatment. Numerous prior studies have used data-driven approaches to identify clusters of similar patients, but it is not yet clear how to best specify what constitutes a clinically meaningful phenotype. This study explored disease subtyping on the basis of temporal development patterns. In particular, we attempted to differentiate infants with autism spectrum disorder into more fine-grained classes with distinctive patterns of early skill development. We modeled the progression of autism explicitly using a continuous-time hidden Markov model. Subsequently, we compared subjects on the basis of their trajectories through the model state space. Two approaches to subtyping were utilized, one based on time-series clustering with a custom distance function and one based on tensor factorization. A web application was also developed to facilitate the visual exploration of our results. Results suggested the presence of 3 developmental subgroups in the ASD outcome group. The two subtyping approaches are contrasted and possible future directions for research are discussed.
Single-tree GMM training

(Georgia Institute of Technology, 2015-05-27) Curtin, Ryan R.
Influence modeling in behavioral data

(Georgia Institute of Technology, 2015-05-15) Li, Liangda

Understanding influence in behavioral data has become increasingly important in analyzing the cause and effect of human behaviors under various scenarios. Influence modeling enables us to learn not only how human behaviors drive the diffusion of memes spread in different kinds of networks, but also the chain reactions evolve in the sequential behaviors of people. In this thesis, I propose to investigate into appropriate probabilistic models for efficiently and effectively modeling influence, and the applications and extensions of the proposed models to analyze behavioral data in computational sustainability and information search. One fundamental problem in influence modeling is the learning of the degree of influence between individuals, which we called social infectivity. In the first part of this work, we study how to efficient and effective learn social infectivity in diffusion phenomenon in social networks and other applications. We replace the pairwise infectivity in the multidimensional Hawkes processes with linear combinations of those time-varying features, and optimize the associated coefficients with lasso regularization on coefficients. In the second part of this work, we investigate the modeling of influence between marked events in the application of energy consumption, which tracks the diffusion of mixed daily routines of household members. Specifically, we leverage temporal and energy consumption information recorded by smart meters in households for influence modeling, through a novel probabilistic model that combines marked point processes with topic models. The learned influence is supposed to reveal the sequential appliance usage pattern of household members, and thereby helps address the problem of energy disaggregation. In the third part of this work, we investigate a complex influence modeling scenario which requires simultaneous learning of both infectivity and influence existence. Specifically, we study the modeling of influence in search behaviors, where the influence tracks the diffusion of mixed search intents of search engine users in information search. We leverage temporal and textual information in query logs for influence modeling, through a novel probabilistic model that combines point processes with topic models. The learned influence is supposed to link queries that serve for the same formation need, and thereby helps address the problem of search task identification. The modeling of influence with the Markov property also help us to understand the chain reaction in the interaction of search engine users with query auto-completion (QAC) engine within each query session. The fourth part of this work studies how a user's present interaction with a QAC engine influences his/her interaction in the next step. We propose a novel probabilistic model based on Markov processes, which leverage such influence in the prediction of users' click choices of suggested queries of QAC engines, and accordingly improve the suggestions to better satisfy users' search intents. In the fifth part of this work, we study the mutual influence between users' behaviors on query auto-completion (QAC) logs and normal click logs across different query sessions. We propose a probabilistic model to explore the correlation between user' behavior patterns on QAC and click logs, and expect to capture the mutual influence between users' behaviors in QAC and click sessions.
Algorithmic techniques for the micron automata processor

(Georgia Institute of Technology, 2015-05-15) Roy, Indranil

Our research is the first in-depth study in the use of the Micron Automata Processor, a novel re-configurable streaming co-processor which is purpose-built to execute thousands of Non-deterministic Finite Automata (NFA) in parallel. By design, this processor is well-suited to accelerate applications which need to find all occurrences of thousands of complex string-patterns in the input data. We have validated this by implementing two such applications, one from network security and the other from bioinformatics, both of which are significantly faster than their state-of-art counterparts. Our research has also widened the scope of the applications which can be accelerated through this processor by finding ways to quickly program any generic graph into it and then search for hard to find features like maximal-cliques and Hamiltonian paths. These applications and algorithms have yielded valuable design-inputs for next generation of the chip which is currently in design phase. We hope that this work paves the way to the early adoption of this upcoming architecture and to efficient solution of some of the currently computationally challenging problems.
Method and software for predicting emergency department disposition in pediatric asthma

(Georgia Institute of Technology, 2015-04-21) Kumar, Vikas

An important application of predictive data mining in clinical medicine is predicting the disposition of patients being seen in the emergency department (ED); such prediction could lead to increased efficiency of our healthcare system. A number of tools have emerged in recent years that use machine learning methods to predict whether patients will be admitted or discharged; however, such models are often limited in that they rely on specialized knowledge, are not optimal, use predictors that are unavailable early in the patient visit, and require memorization of clinical rules and scoring systems. The goal of this study is to develop an effective and practical clinical tool for identifying asthma patients that will be admitted to the hospital. In contrast to existing tools, the model of this study relies on routine knowledge collected early during the patient visit. While most tools specific to asthma are developed using only a few hundred patients, in this study the records of 9,000+ children seen across two major metropolitan emergency departments for asthma exacerbations are used. An unprecedented amount of 70 variables is assessed for predictive strength and early availability; a novel sequence of methods including lasso regularized logistic regression and a modified "best subset" approach is then used to select the final 4-variable model. A web-application is then developed that calculates an admission probability score based on the patient parameters at the point-of-care. The methods and results of this study will be useful for those aiming to develop similar tools as well as ED providers caring for asthma patients.
Efficient inference algorithms for network activities

(Georgia Institute of Technology, 2015-01-08) Tran, Long Quoc

The real social network and associated communities are often hidden under the declared friend or group lists in social networks. We usually observe the manifestation of these hidden networks and communities in the form of recurrent and time-stamped individuals' activities in the social network. The inference of relationship between users/nodes or groups of users/nodes could be further complicated when activities are interval-censored, that is, when one only observed the number of activities that occurred in certain time windows. The same phenomenon happens in the online advertisement world where the advertisers often offer a set of advertisement impressions and observe a set of conversions (i.e. product/service adoption). In this case, the advertisers desire to know which advertisements best appeal to the customers and most importantly, their rate of conversions. Inspired by these challenges, we investigated inference algorithms that efficiently recover user relationships in both cases: time-stamped data and interval-censored data. In case of time-stamped data, we proposed a novel algorithm called NetCodec, which relies on a Hawkes process that models the intertwine relationship between group participation and between-user influence. Using Bayesian variational principle and optimization techniques, NetCodec could infer both group participation and user influence simultaneously with iteration complexity being O((N+I)G), where N is the number of events, I is the number of users, and G is the number of groups. In case of interval-censored data, we proposed a Monte-Carlo EM inference algorithm where we iteratively impute the time-stamped events using a Poisson process that has intensity function approximates the underlying intensity function. We show that that proposed simulated approach delivers better inference performance than baseline methods. In the advertisement problem, we propose a Click-to-Conversion delay model that uses Hawkes processes to model the advertisement impressions and thinned Poisson processes to model the Click-to-Conversion mechanism. We then derive an efficient Maximum Likelihood Estimator which utilizes the Minorization-Maximization framework. We verify the model against real life online advertisement logs in comparison with recent conversion rate estimation methods. To facilitate reproducible research, we also developed an open-source software package that focuses on various Hawkes processes proposed in the above mentioned works and prior works. We provided efficient parallel (multi-core) implementations of the inference algorithms using the Bayesian variational inference framework. To further speed up these inference algorithms, we also explored distributed optimization techniques for convex optimization under the distributed data situation. We formulate this problem as a consensus-constrained optimization problem and solve it with the alternating direction method for multipliers (ADMM). It turns out that using bipartite graph as communication topology exhibits the fastest convergence.
High-performance algorithms and software for large-scale molecular simulation

(Georgia Institute of Technology, 2014-12-17) Liu, Xing

Molecular simulation is an indispensable tool in many different disciplines such as physics, biology, chemical engineering, materials science, drug design, and others. Performing large-scale molecular simulation is of great interest to biologists and chemists, because many important biological and pharmaceutical phenomena can only be observed in very large molecule systems and after sufficiently long time dynamics. On the other hand, molecular simulation methods usually have very steep computational costs, which limits current molecular simulation studies to relatively small systems. The gap between the scale of molecular simulation that existing techniques can handle and the scale of interest has become a major barrier for applying molecular simulation to study real-world problems. In order to study large-scale molecular systems using molecular simulation, it requires developing highly parallel simulation algorithms and constantly adapting the algorithms to rapidly changing high performance computing architectures. However, many existing algorithms and codes for molecular simulation are from more than a decade ago, which were designed for sequential computers or early parallel architectures. They may not scale efficiently and do not fully exploit features of today's hardware. Given the rapid evolution in computer architectures, the time has come to revisit these molecular simulation algorithms and codes. In this thesis, we demonstrate our approach to addressing the computational challenges of large-scale molecular simulation by presenting both the high-performance algorithms and software for two important molecular simulation applications: Hartree-Fock (HF) calculations and hydrodynamics simulations, on highly parallel computer architectures. The algorithms and software presented in this thesis have been used by biologists and chemists to study some problems that were unable to solve using existing codes. The parallel techniques and methods developed in this work can be also applied to other molecular simulation applications.
Parallel algorithms for generalized N-body problem in high dimensions and their applications for bayesian inference and image analysis

(Georgia Institute of Technology, 2014-11-11) Xiao, Bo

In this dissertation, we explore parallel algorithms for general N-Body problems in high dimensions, and their applications in machine learning and image analysis on distributed infrastructures. In the first part of this work, we proposed and developed a set of basic tools built on top of Message Passing Interface and OpenMP for massively parallel nearest neighbors search. In particular, we present a distributed tree structure to index data in arbitrary number of dimensions, and a novel algorithm that eliminate the need for collective coordinate exchanges during tree construction. To the best of our knowledge, our nearest neighbors package is the first attempt that scales to millions of cores in up to a thousand dimensions. Based on our nearest neighbors search algorithms, we present "ASKIT", a parallel fast kernel summation tree code with a new near-far field decomposition and a new compact representation for the far field. Specially our algorithm is kernel independent. The efficiency of new near far decomposition depends only on the intrinsic dimensionality of data, and the new far field representation only relies on the rand of sub-blocks of the kernel matrix. In the second part, we developed a Bayesian inference framework and a variational formulation for a MAP estimation of the label field for medical image segmentation. In particular, we propose new representations for both likelihood probability and prior probability functions, as well as their fast calculation. Then a parallel matrix free optimization algorithm is given to solve the MAP estimation. Our new prior function is suitable for lots of spatial inverse problems. Experimental results show our framework is robust to noise, variations of shapes and artifacts.
Virtual time-aware virtual machine systems

(Georgia Institute of Technology, 2014-07-01) Yoginath, Srikanth B.

Discrete dynamic system models that track, maintain, utilize, and evolve virtual time are referred to as virtual time systems (VTS). The realization of VTS using virtual machine (VM) technology offers several benefits including fidelity, scalability, interoperability, fault tolerance and load balancing. The usage of VTS with VMs appears in two ways: (a) VMs within VTS, and (b) VTS over VMs. The former is prevalent in high-fidelity cyber infrastructure simulations and cyber-physical system simulations, wherein VMs form a crucial component of VTS. The latter appears in the popular Cloud computing services, where VMs are offered as computing commodities and the VTS utilizes VMs as parallel execution platforms. Prior to our work presented here, the simulation community using VM within VTS (specifically, cyber infrastructure simulations) had little awareness of the existence of a fundamental virtual time-ordering problem. The correctness problem was largely unnoticed and unaddressed because of the unrecognized effects of fair-share multiplexing of VMs to realize virtual time evolution of VMs within VTS. The dissertation research reported here demonstrated the latent incorrectness of existing methods, defined key correctness benchmarks, quantitatively measured the incorrectness, proposed and implemented novel algorithms to overcome incorrectness, and optimized the solutions to execute without a performance penalty. In fact our novel, correctness-enforcing design yields better runtime performance than the traditional (incorrect) methods. Similarly, the VTS execution over VM platforms such as Cloud computing services incurs large performance degradation, which was not known until our research uncovered the fundamental mismatch between the scheduling needs of VTS execution and those of traditional parallel workloads. Consequently, we designed a novel VTS-aware hypervisor scheduler and showed significant performance gains in VTS execution over VM platforms. Prior to our work, the performance concern of VTS over VM was largely unaddressed due to the absence of an understanding of execution policy mismatch between VMs and VTS applications. VTS follows virtual-time order execution whereas the conventional VM execution follows fair-share policy. Our research quantitatively uncovered the exact cause of poor performance of VTS in VM platforms. Moreover, we proposed and implemented a novel virtual time-aware execution methodology that relieves the degradation and provides over an order of magnitude faster execution than the traditional virtual time-unaware execution.