Title:
Zero-shot object-goal navigation using multimodal goal embeddings

dc.contributor.advisor Batra, Dhruv
dc.contributor.author Aggarwal, Gunjan
dc.contributor.committeeMember Hoffmann, Judy
dc.contributor.committeeMember Parikh, Devi
dc.contributor.department Computer Science
dc.date.accessioned 2023-05-18T17:54:18Z
dc.date.available 2023-05-18T17:54:18Z
dc.date.created 2023-05
dc.date.issued 2023-05-01
dc.date.submitted May 2023
dc.date.updated 2023-05-18T17:54:18Z
dc.description.abstract My thesis presents a scalable approach for learning open-world object-goal navigation (ObjectNav) – the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., “find a sink”). The approach is entirely zero-shot – i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., “sink,” “bathroom sink,” etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables openworld ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., “Find a kitchen sink”) and when the target room can be inferred (e.g., “Find a sink and a stove”).
dc.description.degree M.S.
dc.format.mimetype application/pdf
dc.identifier.uri https://hdl.handle.net/1853/72039
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Embodied AI
dc.subject multi-modal
dc.subject navigation
dc.subject zero-shot
dc.title Zero-shot object-goal navigation using multimodal goal embeddings
dc.type Text
dc.type.genre Thesis
dspace.entity.type Publication
local.contributor.advisor Parikh, Devi
local.contributor.corporatename College of Computing
local.contributor.corporatename School of Computer Science
relation.isAdvisorOfPublication 2b8bc15b-448f-472b-8992-ca9862368cad
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
relation.isOrgUnitOfPublication 6b42174a-e0e1-40e3-a581-47bed0470a1e
thesis.degree.level Masters
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
AGGARWAL-THESIS-2023.pdf
Size:
5.85 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.87 KB
Format:
Plain Text
Description: