Organizational Unit:

School of Interactive Computing

Permanent Link

https://hdl.handle.net/1853/70783

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/1113

Full item page

Publication Search Results

Now showing 1 - 6 of 6

Towards multi-modal AI systems with open-world cognition

(Georgia Institute of Technology, 2023-04-30) Agrawal, Harsh

A long-term goal in AI research is to build intelligent systems with 'open-world' cognition. When deployed in the wild, AI systems should generalize to novel concepts and instructions. Such an agent would need to perceive both familiar and unfamiliar concepts present in the environment, combine the capabilities of models trained on different modalities, and incrementally acquire new skills to continuously adapt to the evolving world. In this thesis, we look at how we can combine complementary multi-modal knowledge with suitable forms of reasoning to enable novel concept learning. In Part 1, we show that agents can infer unfamiliar concepts in the presence of other familiar concepts by combining multi-modal knowledge with deductive reasoning. Furthermore, agents can use newly inferred concepts to update their vocabulary of known concepts and infer additional novel concepts incrementally. In Part 2, we will look at how we can use task-dependent augmentations for improving robustness in unseen environments. In Part 3, we develop realistic tasks that require understanding novel concepts. We present a benchmark to evaluate the AI system's capability to describe novel objects present in an image. We also show how embodied agents can combine perception with common-sense knowledge to perform household chores like tidying up the house, without any explicit human instruction, even in the presence of unseen objects in unseen environments. Finally, in Part 4, we show that multi-modal knowledge stored in large pre-trained models can be used to teach agents new skills, allowing the agent to perform novel tasks with increasing difficulty in a zero-shot manner.
Leveraging Value-awareness for Online and Offline Model-based Reinforcement Learning

(Georgia Institute of Technology, 2022-12-07) Modhe, Nirbhay

Model-based Reinforcement Learning (RL) lies at the intersection of planning and learning for sequential decision making. Value-awareness in model learning has recently emerged as a means to imbue task or reward information into the objective of model learn- ing, in order for the model to leverage specificity of a task. While finding success in theory as being superior to maximum likelihood estimation in the context of (online) model-based RL, value-awareness has remained impractical for most non-trivial tasks. This thesis aims to bridge the gap in theory and practice by applying the principle of value-awareness to two settings – the online RL setting and offline RL setting. First, within online RL, this thesis revisits value-aware model learning from the perspective of minimizing performance difference, obtaining a novel value-aware model learning objec- tive as a direct upper bound of it. Then, this thesis investigates and remedies the issue of stale value estimates that has so far been holding back the practicality of value-aware model learning. Using the proposed remedy, performance improvements are presented over maximum-likelihood based baselines and existing value-aware objectives, in several continuous control tasks, while also enabling existing value-aware objectives to become performant. In the offline RL context, this thesis takes a step back from model learning and ap- plies value-awareness towards better data augmentation. Such data augmentation, when applied to model-based offline RL algorithms, allows for leveraging unseen states with low epistemic uncertainty that have previously not been reachable within the assumptions and limitations of model-based offline RL. Value-aware state augmentations are found to enable better performance on offline RL benchmarks compared to existing baselines and non-value-aware alternatives.
Improved search techniques for structured prediction

(Georgia Institute of Technology, 2020-07-29) Vijayakumar, Ashwin Kalyan

Many useful AI tasks like machine translation, captioning or program syn- thesis to name a few can be abstracted as structured prediction problems. For these problems, the search space is well-defined but extremely large — all English language sentences for captioning or translation and similarly, all programs that can be generated from a context-free grammar in the case of program syn- thesis. Therefore, inferring the correct output (a sentence or a program) given the input (an image or user-defined specifications) is an intractable search problem. To overcome this, heuristics — hand designed or learnt from data — are often employed. In my work, I propose modified search procedures to output multiple diverse sequences and then, for the task of outputting programs, I propose a novel search procedure that accelerates existing techniques via heuristics learnt from deep networks. Going further, I propose to study the role of memory and search i.e. process each new query with the memory of previous queries — specifically in the context of solving mathematical problems.In the context of sequence prediction tasks like image captioning or translation, I introduce Diverse Beam Search (DBS), an approximate inference technique to decode multiple relevant and diverse outputs. With the objective of producing multiple sentences that are different from each other, DBS modifies the commonly used Beam Search procedure by greedily imposing diversity constraints. In follow-up work, we directly formulate the task of modeling a set of sequences and propose a trainable search procedure dubbed diff-BS. While both algorithms are task-agnostic, image-captioning is used as the test-bed to demonstrate their effectiveness. In the context of program-synthesis, I propose Neural Guided Deductive Search (NGDS), that accelerates deductive search via learnt heuristics. We find that our approach results in a significant speedup without compromising on the quality of the solutions found. Further, I will discuss the application of this technique in the context of programming by examples and synthesis of hard problems for a given solver. Finally, I study the interplay between memory and search, specifically in the context of mathematical problem solving. Analogical reasoning is a strategy commonly adopted by humans while solving problems i.e. new and unseen problems are solved by drawing parallels to previously seen problems. Inspired by such an approach, I propose to learn suitable representations for “problems” that al- lows the reuse of solutions from previously seen problems as a building block to construct the solution for the problem at hand.
Towards Transparent and Grounded Visual AI Systems

(Georgia Institute of Technology, 2020-04-27) Goyal, Yash

My research goal is to build transparent and grounded AI systems. More specifically, my research tries to answer the question -- Do deep visual models make their decisions for the "right reasons"? In my dissertation, I try to answer this question in two ways: 1. Visual grounding. Grounding is essential to build reliable and generalizable systems that are not driven by dataset biases. In the context of the task of Visual Question Answering (VQA), we would expect models to be visually grounded, i.e., looking at the right regions in the image while answering a question. I address this issue of visual grounding in VQA by proposing a) two new benchmarking datasets to test visual grounding, and b) a new VQA model that is visually grounded by design. 2. Transparency. Transparency in AI systems can help system designers find their failure modes and provide guidance to teach humans. I developed techniques for generating explanations from deep models that give us insights into what they are basing their decisions on. Specifically, I study the following -- a) what parts of the inputs VQA models focus on while making a prediction, b) a new counter-example explanation modality where a VQA model has to identify images for which a given question-answer is not true, c) counterfactual visual explanations and how we can use such explanations to teach humans, and d) causal concept explanations (explaining “zebra” class prediction in terms of human-understandable concept “stripes”) by reasoning about the causal relationship between concept explanations, images and classifier predictions.
Building agents that can see, talk, and act

(Georgia Institute of Technology, 2020-04-25) Das, Abhishek

A long-term goal in AI is to build general-purpose intelligent agents that simultaneously possess the ability to perceive the rich visual environment around us (through vision, audition, or other sensors), reason and infer from perception in an interpretable and actionable manner, communicate this understanding to humans and other agents (e.g., hold a natural language dialog grounded in the environment), and act on this understanding in physical worlds (e.g., aid humans by executing commands in an embodied environment). To be able to make progress towards this grand goal, we must explore new multimodal AI tasks, move from datasets to physical environments, and build new kinds of models. In this dissertation, we combine insights from different areas of AI -- computer vision, language understanding, reinforcement learning -- and present steps to connect the underlying domains of vision and language to actions towards such general-purpose agents. In Part 1, we develop agents that can see and talk -- capable of holding free-form conversations about images -- and reinforcement learning-based algorithms to train these visual dialog agents via self-play. In Part 2, we extend our focus to agents that can see, talk, and act -- embodied agents that can actively perceive and navigate in partially-observable simulated environments, to accomplish tasks such as question-answering. In Part 3, we devise techniques for training populations of agents that can comunicate with each other, to coordinate, strategize, and utilize their combined sensory experiences and act in the physical world. These agents learn both what messages to send and who to communicate with, solely from downstream reward without any communication supervision. Finally, in Part 4, we use question-answering as a task-agnostic probe to ask a self-supervised embodied agent what it knows about its physical world, and use it to quantify differences in visual representations agents develop when trained with different auxiliary objectives.
Disentangling neural network representations for improved generalization

(Georgia Institute of Technology, 2020-04-24) Cogswell, Michael Andrew

Despite the increasingly broad perceptual capabilities of neural networks, applying them to new tasks requires significant engineering effort in data collection and model design. Generally, inductive biases can make this process easier by leveraging knowledge about the world to guide neural network design. One such inductive bias is disentanglment, which can help preven neural networks from learning representations that capture spurious patterns that do not generalize past the training data, and instead encourage them to capture factors of variation that explain the data generally. In this thesis we identify three kinds of disentanglement, implement a strategy for enforcing disentanglement in each case, and show that more general representations result. These perspectives treat disentanglement as statistical independence of features in image classification, language compositionality in goal driven dialog, and latent intention priors in visual dialog. By increasing the generality of neural networks through disentanglement we hope to reduce the effort required to apply neural networks to new tasks and highlight the role of inductive biases like disentanglement in neural network design.

Organizational Unit:

School of Interactive Computing

Permanent Link

Research Organization Registry ID

Description

Previous Names

Parent Organization

Parent Organization

Includes Organization(s)

ArchiveSpace Name Record

Filters

Author

Advisor

Date

Organization

Resource Type

Resource Subtype

Has files

Record Type

Settings

Sort By

Results per page

Publication Search Results

Georgia Tech Library

Organizational Unit: School of Interactive Computing

Permanent Link

Research Organization Registry ID

Description

Previous Names

Parent Organization

Parent Organization

Includes Organization(s)

ArchiveSpace Name Record

Filters

Author

Advisor

Date

Organization

Resource Type

Resource Subtype

Has files

Record Type

Settings

Sort By

Results per page

Publication Search Results

Organizational Unit:

School of Interactive Computing