Title:
Zero-shot object-goal navigation using multimodal goal embeddings
Zero-shot object-goal navigation using multimodal goal embeddings
Authors
Aggarwal, Gunjan
Authors
Advisors
Batra, Dhruv
Hoffmann, Judy
Parikh, Devi
Hoffmann, Judy
Parikh, Devi
Advisors
Associated Organizations
Series
Collections
Supplementary to
Permanent Link
Abstract
My thesis presents a scalable approach for learning open-world object-goal navigation
(ObjectNav) – the task of asking a virtual robot (agent) to find any instance of an object
in an unexplored environment (e.g., “find a sink”). The approach is entirely zero-shot –
i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we
train on the image-goal navigation (ImageNav) task, in which agents find the location
where a picture (i.e., goal image) was captured. Specifically, we encode goal images
into a multimodal, semantic embedding space to enable training semantic-goal navigation
(SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After
training, SemanticNav agents can be instructed to find objects described in free-form
natural language (e.g., “sink,” “bathroom sink,” etc.) by projecting language goals into
the same multimodal, semantic embedding space. As a result, our approach enables openworld
ObjectNav. We extensively evaluate our agents on three ObjectNav datasets
(Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% -
20.0% over existing zero-shot methods. For reference, these gains are similar or better than
the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge
winners. In an open-world setting, we discover that our agents can generalize to compound
instructions with a room explicitly mentioned (e.g., “Find a kitchen sink”) and when the
target room can be inferred (e.g., “Find a sink and a stove”).
Sponsor
Date Issued
2023-05-01
Extent
Resource Type
Text
Resource Subtype
Thesis