Representation Learning for Grounding Vision and Language in Hierarchical Robot Planning
Author(s)
Xu, Ruinian
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
The objective of the thesis is to enhance representation learning for grounding vision and language in hierarchical robot planning, aiming to improve the full autonomy and robustness of robot systems for indoor household activities. Autonomous robots interact with local operational environments by following the standard Sense-Plan-Act paradigm. With hierarchical planning, we break the pipeline into sensing, task planning, motion planning and acting. We employ hybrid systems that combine connectionist models with traditional planning algorithms to permit processing data with high variability and producing reliable task and action sequences. Such hybrid designs leverage the strengths of both methods but necessitate aligning output representations of connectionist models and input representations of classical planning algorithms. Our investigation focuses on designing connectionist models to provide the necessary specification information as directly as possible, which improves task relevant performance metrics. We primarily exploring representation learning for hierarchical task and motion planning. For motion planning, assuming the task sequence has been established, our focus is on grounding perceptual inputs into intermediate representations for supporting motion planning. We start with simple robotic grasping and then extend to more general robotic manipulation. Input representations for motion planning algorithms are specified in the SE(3) frame. The research literature on grasp representation learning typically examines how to ground an RGB-D image into multiple oriented grasp bounding boxes. Assuming top-down grasp orientations, a 2D oriented grasp bounding box is converted into a 2.5D pose, which is then utilized for planning motion trajectories. However, learning bounding box representations via direct regression often results in a lack of sufficient understanding of the spatial context of the image. This insufficient understanding can negatively impact the efficiency of representation learning and the robustness of deep networks, potentially further compromising the performance and reliability of the entire execution. Beyond simple grasp action, general robotic manipulation requires grounding perceptual inputs into representations capturing action-level information. We employ the concept of affordance, which describes potential physical interactions between objects and directly associates perception to actions for motion planning. In representation learning, affordance prediction is typically addressed through segmentation methods, which output pixel-wise affordance labels. However, affordance segmentation is often insufficient for computing the pose in the SE(3) frame since it does not encode directional information for affordances like cutting, which are direction-sensitive. Additionally, the accuracy of the converted 3D pose can be heavily dependent on the overall segmentation performance. To develop more comprehensive affordance representation, studies have investigated category-level and task-level affordance keypoints. Keypoint representation augments affordance semantics with action geometry allowing for the conversion from 2D representations to 3D poses. However, category-level and task-level keypoint representations often struggle to generalize to novel scenarios involving unseen categories and tasks. Addressing these limitations is crucial for advancing the capabilities of robotic systems in unstructured environments. In terms of task planning, we concentrate on
language-conditioned task planning can be decomposed into sub-problems of natural language understanding and task planning. Early methods address this problem sequentially by first performing semantic parsing to transform natural language into symbolic representations and then utilizing symbolic planning algorithms to produce action sequences. However such methods rely heavily on the syntactic structure of the text and struggle to generalize to complex and ambiguous natural language. With development of deep learning, deep neural networks with remarkable capability in representation learning have been employed to address the above issue. Learning-based methods for language-conditioned task planning are normally built in a fully connectionist manner. Despite their success in representation learning, these methods often face challenges in long-horizon planning, where the need for robust and accurate task execution over extended sequences becomes critical. The research gap lies in creating a system that combines efficient representation learning for processing highly variable data with the robust and accurate planning needed for generating reliable action sequences. Balancing these aspects is crucial for advancing the field of language- conditioned task planning and improving the autonomy and effectiveness of robotic systems.
Sponsor
Date
2025-04-04
Extent
Resource Type
Text
Resource Subtype
Dissertation