Title:
Zero-shot object-goal navigation using multimodal goal embeddings
Zero-shot object-goal navigation using multimodal goal embeddings
Author(s)
Aggarwal, Gunjan
Advisor(s)
Batra, Dhruv
Editor(s)
Collections
Supplementary to
Permanent Link
Abstract
My thesis presents a scalable approach for learning open-world object-goal navigation
(ObjectNav) – the task of asking a virtual robot (agent) to find any instance of an object
in an unexplored environment (e.g., “find a sink”). The approach is entirely zero-shot –
i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we
train on the image-goal navigation (ImageNav) task, in which agents find the location
where a picture (i.e., goal image) was captured. Specifically, we encode goal images
into a multimodal, semantic embedding space to enable training semantic-goal navigation
(SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After
training, SemanticNav agents can be instructed to find objects described in free-form
natural language (e.g., “sink,” “bathroom sink,” etc.) by projecting language goals into
the same multimodal, semantic embedding space. As a result, our approach enables openworld
ObjectNav. We extensively evaluate our agents on three ObjectNav datasets
(Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% -
20.0% over existing zero-shot methods. For reference, these gains are similar or better than
the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge
winners. In an open-world setting, we discover that our agents can generalize to compound
instructions with a room explicitly mentioned (e.g., “Find a kitchen sink”) and when the
target room can be inferred (e.g., “Find a sink and a stove”).
Sponsor
Date Issued
2023-05-01
Extent
Resource Type
Text
Resource Subtype
Thesis