Visual shape and pose recovery for robotic manipulation

Loading...
Thumbnail Image
Author(s)
Lin, Yunzhi
Editor(s)
Associated Organization(s)
Supplementary to:
Abstract
The objective of this thesis is to extract visual shape and pose information for robotics manipulation tasks, which would enhance the robot’s capability to operate in less-structured scenes. Manipulation is a multi-step task consisting of sequential actions applied to an object, including perception, path planning, and closure of the gripper, followed by a task-relevant motion with the grasped object. Perception, which serves as the primary input source, has been widely studied. Recent research has primarily focused on the task of grasping objects effectively. The goal is to enable robotic systems to grasp a diverse range of objects with high accuracy. While human grasping capabilities at a young age may seem effortless, precise robot grasping remains a formidable challenge due to the vast diversity of objects a robotic arm could grasp and the intricate contact dynamics associated with specific robot hand designs. Deep learning has emerged as a promising approach to address these challenges by detecting SE(2) × R² grasp representations associated with parallel plate grippers. However, existing methods still suffer from two related issues: sparse grasp annotations or insufficiently rich data, leading to covariate shift problems. Another limitation lies in the grasp configurations themselves. Most algorithms are designed for the bin-picking problem, which is task-agnostic and only performs top-down grasps. However, specific scenarios may impose constraints or limitations, necessitating the exploration of better solutions that permit various task-relevant grasp configurations. It is worth noting that shape information presents an alternative approach to tackling the problem aforementioned, as explicit shape representations can inform fine-grained grasping operations. Beyond how to grasp, how to manipulate is a more demanding task. It requires the robot to be aware of target-centric information. This ability includes locating objects and their poses, also known as the 6-DoF pose estimation problem (i.e., 6 degrees of freedom, from 3D position + orientation). Accurate, real-time pose information of nearby objects in the scene would allow robots to engage in semantic interaction. The problem of pose estimation is a rich topic in the computer vision community, yet most existing methods have focused on instance-level object pose estimation. Such methods suffer from a lack of scalability. On the other hand, category-level object pose estimation opens the door to work on all the targets within a specific category, which promises to scale better for real-world applications. More recently, generalized object pose estimation has attracted more attention as it removes the assumption of known instances or categories. It is more applicable and accessible compared to the methods mentioned above. In this thesis, we focus on improving robotic manipulation with primitive shape recognition, category-level pose estimation and tracking, generalized pose estimation and tracking, and multi-level robotics scene understanding. A series of methods are proposed to improve the applicability of real-world robotic manipulation. To alleviate the data insufficiency problem and generate multiple 3D grasp configurations, we propose a segmentation-based architecture for decomposing objects sensed with a depth camera into multiple primitive shapes, along with a post-processing pipeline for robotic grasping. Segmentation employs a deep network, trained on synthetic data with 6 classes of primitive shapes and generated using a simulation engine. Each primitive shape is designed with parametrized grasp families, permitting the pipeline to identify multiple grasp candidates per shape region. The grasps are rank-ordered, with the first feasible one chosen for execution. For category-level object pose estimation, we propose a simple and efficient RGB-based approach (without depth) that only requires oriented 3D bounding box annotations at training time and thus does not require CAD models for training. This design decision allows us to take advantage of large collections of real-world images. We also extend the design to the tracking problem, which further incorporates uncertainty estimation through a tracklet-conditioned deep network and a filtering process. To further expand the scalability of the object pose estimation methods, we investigate the inverse use of parallel NeRF for robust object pose estimation in a render-and-compare manner. It can be applied to any novel object, removing the limitation of known category assumptions. We also explore the generalized object pose tracking problem in dynamic environments. We develop a streamlined pipeline combining video segmentation, uncertainty-aware keypoint refinement, and structure from motion, effectively tracking 6-DoF poses from short-term monocular RGB video. We also generate a large-scale photo-realistic synthetic dataset for training and evaluation. Finally, we establish a comprehensive scene representation for advanced manipulation, which includes high-fidelity 3D reconstruction, a rough approximation of primitive shapes, and accurate object pose estimation. Throughout this research, we study how to extract visual shape and pose information for real-world manipulation scenarios. The rest of the thesis is organized as follows. Chapter 1 covers a high-level idea of the related fields. Chapter 2 focuses on shape recognition for object grasping. Chapter 3 introduces a keypoint-based RGB-only category-level pose estimator. Chapter 4 extends it into a tracker via a tracklet-conditioned network and filtering process. Chapter 5 explores the idea of inverse use of NeRF for pose estimation in parallel. Chapter 6 explores the large-scale dataset and uncertainty-aware keypoint estimation for object pose tracking. In the end, Chapter 7 introduces a multi-level scene representation with multi-view RGB inputs for the robotics manipulation task.
Sponsor
Date
2024-08-01
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI