Foundation Models for Robotic Manipulation: Opportunities and Challenges

Author(s)
Li, Yunzhu
Advisor(s)
Editor(s)
Associated Organization(s)
Series
Series
Collections
Supplementary to:
Abstract
Foundation models, such as GPT, have achieved remarkable progress in natural language and vision, demonstrating strong adaptability to new tasks and scenarios. Physical interaction, such as cooking, cleaning, or caregiving, remains a frontier where these models and robotic systems have yet to reach comparable levels of generalization. In this talk, I will discuss opportunities for incorporating foundation models into robotic pipelines to extend capabilities beyond those of traditional methods. The focus will be on two areas: (1) task specification and (2) task-level planning. The central idea is to translate the commonsense knowledge embedded in foundation models into structural priors that can be integrated into robot learning systems. This approach combines the strengths of different modules (for example, VLMs for task interpretation and constrained optimization for motion planning), achieving the best of both worlds. I will show how such integration enables robots to interpret free-form natural language instructions and perform a wide range of real-world manipulation tasks. I will conclude by discussing current limitations of foundation models, key challenges ahead (particularly in multi-modal sensing and world modeling), and potential avenues for progress.
Sponsor
Date
2025-10-15
Extent
68:39 minutes
Resource Type
Moving Image
Resource Subtype
Lecture
Rights Statement
Unless otherwise noted, all materials are protected under U.S. Copyright Law and all rights are reserved