Learning to See While Learning to Act: Diffusion Models for Active Perception in Robot Imitation

Author(s)
Wang, Kuancheng
Advisor(s)
Editor(s)
Associated Organization(s)
Series
Supplementary to:
Abstract
Robots struggle with manipulating objects they can't fully observe, just as kids have trouble finding their toy hidden in a box. This paper introduces C3DM-Multiview, a novel approach that teaches robots to actively look around their environment while learning to complete tasks via diffusion models. Current methods for robot manipulation often rely on the observation from a fixed camera that can't capture occluded objects. C3DM-Multiview instead enables robots to dynamically adjust their viewpoints. Integrating camera movement with action refinement in a single diffusion model maintains a consistent relationship between what the robot sees and how the robot acts. Our camera-consistent noising strategy ensures that every noisy action in camera coordinates naturally emphasizes the pertinent dimensions for the current observations. C3DM-Multiview significantly outperforms existing methods on occlusion-specific tasks built based on Ravens benchmark, exceeding previous approaches by up to 72% on challenging scenarios where objects are initially occluded from view. When evaluated on the RLBench, our approach achieves a state-of-the-art 79% average success rate across six diverse manipulation tasks, surpassing methods that rely on full point-cloud based on observations from multiple fixed cameras. These results demonstrate that learning where to look while learning how to act leads to more robust robot manipulation capabilities, particularly in complex real-world environments.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis
Rights Statement
Rights URI