Organizational Unit:

School of Computer Science

Permanent Link

https://hdl.handle.net/1853/70781

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/945

Full item page

Publication Search Results

Now showing 1 - 10 of 735

Foley Scholar Winner and Finalists Presentations Spring 2024

(Georgia Institute of Technology, 2024-03-07) Bhat, Karthik Seetharama ; Narechania, Arpit ; Pendse, Sachin ; Riggs, Alexandra Teixeira

Foley Scholar Award Winner: Envisioning Technology-Mediated Futures of Care Work, Karthik Seetharama Bhat. Caregiving is a universal activity that is receiving increasing attention among technologists and researchers in the wake of the COVID-19 pandemic. Emerging technologies like conversational AI, augmented and virtual reality, and smart homes have all been described as potentially revolutionary technologies in care work, intended to automate and transform the overall care experience for caregivers and care recipients. However, such promises have yet to translate to successful deployments as these technological innovations come up against socioculturally situated traditions of care work that prioritize human connection and interaction. In this talk, I will share empirical studies looking into how formal care workers (in clinical settings) and informal care workers (in home settings) reconcile technology utilization in care work with sociocultural expectations and norms that dissuade them. I will then discuss possible technology-mediated futures of care work by positing how emerging technologies could best be designed for and integrated into activities of care in ways that unburden care workers while ensuring quality care.
Real-time detection of traffic signs on mobile devices using deep learning

(Georgia Institute of Technology, 2023-12-13) El Bouzekraoui, Younes

The rapid advancement in object detection (OD) models has enabled a multitude of applications in computer vision, yet the deployment on resource-constrained devices, such as smartphones and single-board computers like the Raspberry Pi or NVIDIA Jetson, remains challenging due to hardware limitations. This work proposes a methodology to address the issue by implementing quantization techniques on a YOLO (You Only Look Once) model specifically adapted for the detection of stop signs. The intention is to improve road safety by allowing low-end devices to recognize effectively and preemptively stop signs, thereby potentially reducing the likelihood of traffic-related incidents. We present a compressed YOLO model tailored for efficient operation on these devices, trained on a dataset comprised exclusively of stop sign images to optimize detection accuracy within this constrained context. We apply quantization to reduce the precision of the model’s parameters. The performance of the resulting model is rigorously evaluated in terms of its frames-per-second (FPS) and Mean Average Precision (mAP), metrics that balance operational efficiency with detection reliability. Our findings demonstrate that the model, once quantized, maintains a high mAP while achieving a significant improvement in FPS when compared to its uncompressed counterpart. These results not only reinforce the viability of deploying advanced object detection models on low-resource devices but also provide a framework for similar adaptations of YOLO models for various real-world applications where resource efficiency is required.
The Effects of Identity Characteristics in Online Social Systems

(Georgia Institute of Technology, 2023-12-12) Appling, Darren Scott

The ways in which individuals make meaningful personal decisions affecting their beliefs about the world, participating in online social systems, are conditioned on individuals' identity and group memberships. This work aims to answer questions surrounding how contextual factors of identity (characteristics and relations between them) affect individuals' various kinds of interactions and outcomes in online spaces. In my first study, I conducted interviews and mixed-methods analyses across two different kinds of identity characteristics: (1) epistemic-derived: individuals engaging in political discussions and, (2) experience-derived: those engaged in discussion of post traumatic stress disorder. Using epistemically-related identity characteristics, I generated personas from interviews with individuals regarding the sharing of primarily politically oriented misinformation according to a mainstream fact-checking entity. I found that interventions in the misinformation context could benefit from personalization to social identities by leveraging these personas. % an explicitly guided structuring mechanism for intersectional analysis, In my second study, I use an experience-derived characteristic of online identity, self-disclosure of diagnosis for PTSD sufferers, to examine members of an online support community for post traumatic stress disorder, to understand and analyze the effects of identity disclosure on behavior by individual members and other members they interacted with. Mixed-methods analysis revealed that there are in fact unique characteristics to those who self-disclose a diagnosis and those who do not. In my third study I present a new methodology to construct, evaluate, and apply a novel computational intersectional identity framework (CIIF). CIIF provides a new scalable capability to detect identity disclosures of individuals in online social communities. The methodology enables future researchers to maintain it alongside changes in natural language related to identity disclosure.
Cleaning and Learning Over Dirty Tabular Data

(Georgia Institute of Technology, 2023-12-11) Li, Peng

The quality of machine learning (ML) applications is only as good as the quality of the data they train on. Unfortunately, real-world data is rarely free of errors, especially for tabular data, which frequently suffers from data issues like missing values, outliers, and inconsistencies. Therefore, data cleaning is widely regarded as an essential step in an ML workflow and an effective way to improve ML performance. However, data cleaning is often a time-consuming and expensive process that heavily relies on human efforts. Traditional data cleaning approaches often treat data cleaning as a standalone task independently of its downstream applications, which may not effectively improve ML performance and can sometimes even worsen it. Furthermore, it often leads to unnecessary costs for cleaning errors that have a minor impact on ML performance. This dissertation jointly considers data cleaning and machine learning, and focuses on developing algorithms and systems for cleaning and learning over dirty tabular data, with the dual objectives of (1) optimizing downstream ML performance and (2) minimizing human efforts. We start with a CleanML empirical study that systematically evaluates the impact of data cleaning on downstream ML performance. We then introduce CPClean, a cost-effective human-involved data cleaning algorithm for ML that minimizes human cleaning efforts while preserving ML performance. We subsequently demonstrate DiffPrep, an automatic data preprocessing method that can efficiently select data preprocessing (cleaning) pipelines to maximize downstream ML performance without human involvement. Finally, we present Auto-Tables that can automatically transform tables from non-standard formats into a standard format without any human effort. The works in this dissertation can be integrated into a comprehensive system for cleaning and learning over dirty tabular data.
Fortifying Cyber-Physical Systems through Comprehensive Bug-finding and Mitigation

(Georgia Institute of Technology, 2023-12-08) Kim, Seulbae

With the rapid growth of Cyber-Physical Systems (CPS) in various domains, ensuring their security and correctness has become increasingly critical. CPS, intricate amalgamations of physical and cyber components, necessitate security approaches that extend beyond conventional software security methodologies. This thesis focuses on formulating a comprehensive strategy to automatically identify and mitigate cyber-physical bugs and attacks across all layers of CPS, encompassing the application layer, middleware suite, and hardware layer. First, a vehicular fuzzing framework is developed to uncover logic bugs in autonomous driving system software. This framework uses real-world traffic rules to build driving test oracles and detect safety-critical misbehaviors, such as collisions. The fuzzer generates and mutates realistic driving scenarios and assesses the semantic quality of autonomous driving by referring to the physical states of the vehicle to guide the fuzzing process effectively. Second, a customizable fuzzing framework is devised for Robot Operating System (ROS), a widely used middleware suite for modern robot development. This framework leverages the message-driven distributed architecture of ROS and ROS-based systems to explore system states by injecting data messages. Simultaneously executing the robotic system under test in both the real world and a simulator, this framework captures the states from both domains, scrutinizing for cyber-physical discrepancies that can lead to errors. Finally, to safeguard CPS from irreversible damages stemming from bugs, attacks, or user failures, a dynamics-based runtime monitoring system is proposed. This method speculatively predicts future states to proactively detect potential safety violations in advance. Once a forthcoming unsafe state is anticipated, this system searches for corrective maneuvers to divert future states, effectively transforming reactive safety measures into preemptive measures.
Improving the Understanding of Malware using Machine Learning

(Georgia Institute of Technology, 2023-12-06) Downing, Evan

When a security organization receives a sample (whether it be a binary, script, etc.) from their customers, their goal is to determine if it is malicious or benign. Because samples can be received in large volumes, automated triage and analysis is required to keep up. Broadly speaking, these automated solutions are composed of statistical models and heuristic rulesets, which use distinctive attributes from malicious samples observed in the past. In response, attackers will evolve their samples to evade analysis and detection over time. To evade static analysis, malware binaries can obfuscate themselves by removing system calls and strings from plain view. This prevents reverse engineers from statically identifying binary functions of interest to trigger during dynamic analysis. To evade dynamic analysis detection, malware can randomize their artifacts (such as filenames, process names, etc.), which makes automatically mining behaviors which generalize for future variations difficult. To address these challenges, this thesis proposes a framework to identify malicious functions in static malware binaries for analysis, and behavior combinations in dynamic analysis reports for detection. The framework takes incoming sample binaries submitted to the organization to be analyzed as input. First, DeepReflect localizes malicious functions within the unpacked malware binaries (statically), allowing analysts to target specific regions for further dynamic analysis. DeepReflect increases the malicious function detection Area Under the Curve (AUC) value by 6-10% compared to four state-of-the-art approaches on a dataset of 36k unique, unpacked malware binaries. After executing the samples in a controlled sandbox, BCRAFTY uses its dynamic report to extract and generalize behavior combinations to detect similar malware samples in the future. Compared to using analyst-defined behaviors alone, BCRAFTY increases the malware detection True Positive Rate (TPR) value by 7.5% while keeping the False Positive Rate (FPR) value near 0.3% .
Understanding and mitigating security threats in software supply chain

(Georgia Institute of Technology, 2023-11-29) Xiao, Feng

Modern software heavily relies on the software supply chain ecosystem to boost development efficiency and reduce costs. Due to its popularity, securing the software supply chain has become an increasingly critical concern for individuals, organizations, and governments alike. Unfortunately, the inherent vastness, complexity, and interdependence of the software supply chain often render existing security techniques inadequate. In particular, as software developers nowadays incorporate a plethora of unfamiliar third-party code, it is becoming increasingly challenging for existing vulnerability detection and mitigation techniques to understand and restrict program behaviors. To tackle the diverse threats and rising complexities, my dissertation proposes a series of novel program analysis techniques that focus on validating the interactions between supply chain modules. Along this path, I have designed and implemented a robust, end-to-end program analysis framework. In this dissertation, I first present LYNX and JASMINE, which are designed to assist developers in understanding the security-related properties of complex supply chain software. Specifically, LYNX is capable of revealing and comprehending hidden execution paths or input spaces that arise from careless use of supply chain software packages. LYNX has led to the discovery of a novel attack vector, hidden property abusing, as well as 15 previously unknown vulnerabilities. JASMINE, on the other hand, is a scalable program analysis diagram that simplifies the complexity of supply chain security analysis by focusing on inter-module behaviors when analyzing bloated and complex third-party modules. By applying JASMINE to real-world programs in the npm supply chain, we successfully detected 22 new vulnerabilities, many of which were assigned the highest severity rating by the CVSS. In the end, I will present XGuard, a tool designed for developers to implement robust and efficient security protection. This tool utilizes the comprehensive security properties identified by LYNX and JASMINE to automatically generate detailed protection policies. With the policy, XGuard ensures the integrity of data and control flow within the supply chain software.
Shape-biased representations for object category recognition

(Georgia Institute of Technology, 2023-11-28) Stojanov, Stefan

While object recognition is a classical problem in computer vision that has witnessed incredible progress as a result of contemporary deep learning research, the key challenges of developing systems that can learn object categories from continually arriving data, from a few samples, and with limited supervision still remain. In this dissertation, we aim to borrow the learning strategy of shape bias and environmental bias of repetition, both of which are observed in young children, and apply them to continual, low-shot, and self-supervised learning of objects and object parts. In the continual learning domain, we demonstrate that repetition of learned concepts significantly ameliorates catastrophic forgetting. For low-shot learning we develop two methods for learning shape-biased object representations with decreasing supervision requirements: based on learning a joint image and 3D shape metric space from point clouds, and by self-supervised learning of object parts from multi-view pixel correspondences. We demonstrate that these methods of introducing a shape bias improve low-shot category recognition. Last, we find that contrastive learning from multi-view images allows for category-level part matching with performance competitive with baselines that have over 10 times more parameters, while being trained only on synthetic data. To support our investigations, we present two synthetic 3D object datasets, Toys200 and Toys4K, and develop a series of highly realistic synthetic data rendering systems that enable real-world generalization.
Algorithmic Techniques for Variant Selection in Genome Graphs

(Georgia Institute of Technology, 2023-08-25) Tavakoli, Neda

Variation graph representations are projected to either replace or supplement conven- tional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogs of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. This dissertation research takes a holistic approach to develop a novel math- ematical framework for variant selection in genome graphs subject to preserving sequences of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants (e.g., SNPs, indels, or structural variants) and whether the goal is to minimize the number of positions at which variants are incorporated or to mini- mize the total number of variants incorporated. We classify the computational complexity of these problems and provide efficient algorithms along with their software implemen- tations. We also empirically evaluate run-time performance and reduction in number of variants achieved by the multiple algorithms proposed in this dissertation research. The designed mathematical framework aims to address the following prominent problem cases. First, we develop novel algorithms and complexity results for preserving sequences of length α represented by paths in the complete variation graph, under the Hamming distance and edit distance metrics. We also assess the extent of variant reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to characteristics of short and long-read resequencing technologies. Additionally, we estab- lish benchmark data sets and tools to empirically evaluate the impact of variant selection on read-to-graph mapping. Next, we consider the problem of preserving α-long substrings of the input haplotypes, the so-called haplotype-aware variant selection problem, under the Hamming distance met- ric. We show that this problem is NP-hard and develop an Integer Linear Programming (ILP) solution for it. The solution is effective in finding optimal solutions even on hu- xx man chromosome-scale graphs and for a variety of sequence lengths and error percentage thresholds. In addition to ensuring optimality, our results demonstrate that a substantial additional reduction in the number of selected variants can be achieved when restricting preserved sequences to individual haplotypes. Finally, we develop the haplotype-aware variant selection problem under the edit dis- tance metric, where the input haplotypes may contain SNPs, indels, and structural variants. We designed a two-level ILP formulation for the problem. We implement a software frame- work for this problem and empirically evaluate its effectiveness in finding optimal solutions on human chromosome-scale graphs and for a variety of sequence lengths and error per- centage thresholds. We also experimentally evaluate the impact of the variant reduction obtained here on sequence-to-graph mapping accuracy. Taken together, these formulations and results are expected to significantly advance our knowledge on the creation and use of genome graphs.
Scalable, Efficient, and Fair Algorithms for Structured Convex Optimization Problems

(Georgia Institute of Technology, 2023-08-24) Ghadiri, Mehrdad

The growth of machine learning and data science has necessitated the development of provably fast and scalable algorithms that incorporate ethical requirements. In this thesis, we present algorithms for fundamental optimization algorithms with theoretical guarantees on approximation quality and running time. We analyze the bit complexity and stability of efficient algorithms for problems including linear regression, $p$-norm regression, and linear programming by showing that a common subroutine, inverse maintenance, is backward stable and that iterative approaches for solving constrained weighted regression problems can be carried out with bounded-error pre-conditioners. We also present conjectures regarding the running time of computing symmetric factorizations for Hankel matrices that imply faster-than-matrix-multiplication time algorithms for solving sparse poly-conditioned linear programs. We present the first subquadratic algorithm for solving the Kronecker regression problem, which improves the running time of all steps of the alternating least squares algorithm for the Tucker decomposition of tensors. In addition, we introduce the Tucker packing problem for computing an approximately optimal core shape for the Tucker decomposition problem. We prove this problem is NP-hard and provide polynomial-time approximation schemes for it. Finally, we show that the popular $k$-means clustering algorithm (Lloyd's heuristic) can result in outcomes that are unfavorable to subgroups of data. We introduce the socially fair $k$-means problem for which we provide a very efficient and practical heuristic. For the more general problem of $(\ell_p,k)$-clustering problem, we provide bicriteria constant-factor approximation algorithms. Many of our algorithms improve the state-of-the-art in practice.