Title:
Zero-shot object-goal navigation using multimodal goal embeddings
Zero-shot object-goal navigation using multimodal goal embeddings
dc.contributor.advisor | Batra, Dhruv | |
dc.contributor.author | Aggarwal, Gunjan | |
dc.contributor.committeeMember | Hoffmann, Judy | |
dc.contributor.committeeMember | Parikh, Devi | |
dc.contributor.department | Computer Science | |
dc.date.accessioned | 2023-05-18T17:54:18Z | |
dc.date.available | 2023-05-18T17:54:18Z | |
dc.date.created | 2023-05 | |
dc.date.issued | 2023-05-01 | |
dc.date.submitted | May 2023 | |
dc.date.updated | 2023-05-18T17:54:18Z | |
dc.description.abstract | My thesis presents a scalable approach for learning open-world object-goal navigation (ObjectNav) – the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., “find a sink”). The approach is entirely zero-shot – i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., “sink,” “bathroom sink,” etc.) by projecting language goals into the same multimodal, semantic embedding space. As a result, our approach enables openworld ObjectNav. We extensively evaluate our agents on three ObjectNav datasets (Gibson, HM3D, and MP3D) and observe absolute improvements in success of 4.2% - 20.0% over existing zero-shot methods. For reference, these gains are similar or better than the 5% improvement in success between the Habitat 2020 and 2021 ObjectNav challenge winners. In an open-world setting, we discover that our agents can generalize to compound instructions with a room explicitly mentioned (e.g., “Find a kitchen sink”) and when the target room can be inferred (e.g., “Find a sink and a stove”). | |
dc.description.degree | M.S. | |
dc.format.mimetype | application/pdf | |
dc.identifier.uri | https://hdl.handle.net/1853/72039 | |
dc.language.iso | en_US | |
dc.publisher | Georgia Institute of Technology | |
dc.subject | Embodied AI | |
dc.subject | multi-modal | |
dc.subject | navigation | |
dc.subject | zero-shot | |
dc.title | Zero-shot object-goal navigation using multimodal goal embeddings | |
dc.type | Text | |
dc.type.genre | Thesis | |
dspace.entity.type | Publication | |
local.contributor.advisor | Parikh, Devi | |
local.contributor.corporatename | College of Computing | |
local.contributor.corporatename | School of Computer Science | |
relation.isAdvisorOfPublication | 2b8bc15b-448f-472b-8992-ca9862368cad | |
relation.isOrgUnitOfPublication | c8892b3c-8db6-4b7b-a33a-1b67f7db2021 | |
relation.isOrgUnitOfPublication | 6b42174a-e0e1-40e3-a581-47bed0470a1e | |
thesis.degree.level | Masters |