Enhancing the Controllability of Visual Navigation Agents with Language-Conditioned Preferences

Author(s)
Putta, Pranav
Advisor(s)
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
While recent advances in visual navigation have yielded impressive and robust goal- driven agents, the ability to incorporate user-specified constraints and preferences remains an important challenge. ImageNav agents are powerful models for exploring decision-making in large neural networks. Given an image-goal, they are trained end- to-end via reinforcement learning to effectively utilize state history, exploit world knowledge about semantic regularities in environments, and implicitly construct sub-goals to reach their destination. This thesis proposes a methodology to enhance the controllability of ImageNav models via language conditioning by leveraging the common sense reasoning skills, generalization, and intuitive interface of large lan- guage models (LLMs). The proposed method first distills the policy of a pre-trained ImageNav agent into a vision-language-action (VLA) model. We then explore a method for generating a synthetic preference dataset by exploiting human prefer- ence knowledge embedded in LLMs such as ChatGPT. The VLA model is then fine-tuned over the preference dataset that enables language specified user prefer- ences to bias the exploration behavior of the agents as desired. Finally, we perform ablative experiments of different methods to iteratively improve the policy to con- dition on user preferences, primarily filtered behavior cloning and direct preference optimization, and study its generalization performance to novel, unseen preferences.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis
Rights Statement
Rights URI