Multimodal Human Behavior Modeling: From Understanding to Generation
Author(s)
Lai, Bolin
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Humans are learning and changing the world through a variety of behaviors in our daily activities. Thus, human behavior modeling is a critical step to develop AI agents that are able to assist us in various tasks. In contrast with learning objects, scenes and textures, human behaviors are inherently purposeful, guided by underlying intentions and goals. Additionally, human behaviors involve precise and adaptive interactions with the environment, characterized by fine-grained and nuanced control. The two key differences require innovative approaches for AI models to understand our intentions in the behaviors and capture the nuance of our actions in different tasks.
In my dissertation, I elaborate my research on leveraging multimodal inputs to capture the underlying intentions and enable precise controllability on human actions in both understanding and generation problems. First of all, I develop the first audio-visual egocentric gaze anticipation model that forecasts gaze behaviors by fusing audio-visual streams in temporal and spatial dimensions separately. Second, I collect a multimodal social interaction dataset with detailed annotations, and analyze the contribution of visual signals to social scenario understanding. Third, I introduce a novel egocentric action frame generation task for efficient skill learning, and an innovative method to enhance action generation performance by bridging the gap of large language models and diffusion models in the feature space. Finally, I propose a unified text-image-to-video (TI2V) generation problem that includes all existing TI2V settings, and introduce a novel training-free method to condition pre-trained text-to-video foundation models on any number of given images. In conclusion, the ultimate goal of my research is to enable AI models to better understand and interact with people, paving the way towards human-centric artificial intelligence.
Sponsor
Date
2026-05
Extent
Resource Type
Text
Resource Subtype
Dissertation (PhD)