Large-Scale Offline Pre-Training Bootstraps Embodied Intelligence
Loading...
Author(s)
Majumdar, Arjun
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
A central goal in Artificial Intelligence (AI) is to develop embodied intelligence -- i.e., embodied agents such as mobile robots that can accomplish a wide variety of tasks in real-world, physical environments. In this dissertation, we will argue that offline pre-training of foundation models on web-scale data can bootstrap embodied intelligence.
In part 1, we present VC-1, a visual foundation model pre-trained (primarily) on video data collected from an egocentric perspective. We systematically demonstrate that such models substantially benefit from pre-training dataset diversity by introducing CortexBench, an embodied AI (EAI) benchmark curated from a diverse collection of existing EAI tasks spanning locomotion, navigation, and dexterous or mobile manipulation.
In part 2, we first demonstrate that visual grounding learned from internet data (i.e., image-caption pairs from the web) can be transferred to an instruction-following visual navigation agent (VLN-BERT). Then, we present ZSON, a highly scalable approach for learning to visually navigate to objects specified in open-vocabulary, natural language instructions such as “find the kitchen sink.”
In part 3, we study spatial understanding in real-world indoor environments. First, we introduce an evaluation benchmark (OpenEQA) to measure progress on answering open-ended questions about 3D scenes. Then, we present a modular agent that leverages pre-trained components such as vision-language models (VLMs) to address the question-answering task.
Sponsor
Date
2024-07-27
Extent
Resource Type
Text
Resource Subtype
Dissertation