Developing an End-To-End Approach for Training a Real-Time Multimodal Gesture Generation Model

Author(s)
Punjwani, Saif
Advisor(s)
Editor(s)
Associated Organization(s)
Series
Supplementary to:
Abstract
As we move towards a virtually intensive landscape, the need for us to embody communication between conversational AIs and human agents is crucial, as those interactions will shape our future. The proposed system aims to enhance human- agent interactions by generating gestures that closely resemble human motion and align with the accompanying speech, with a target inference latency of under 0.03 seconds. The gesture generation pipeline’s foundation is a dataset that aims to capture a wide range of human gestures across various contexts. The system employs key- point detection to extract meaningful features from the pre-processed videos and transform them into gesture vectors. The model architecture consists of two main components: the Multimodal Gesture Encoder (MGE) and the Gesture Interpreta- tion and Generation Models (GIGMs). The MGE interprets spoken language and predicts corresponding physical gestures, while the GIGMs synthesize and output full-body gestures that align with the subtleties of spoken language. The system’s real-time performance and computational efficiency are achieved through the use of recurrent neural networks with activity sparsity and sparse back-propagation, as well as efficient gesture processing pipelines. The integration of additional modalities of inputs such as text, video, and audio, focusing on gesture generation components of facial expressions and body posture, further enhances the naturalness and expres- siveness of the generated gestures.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis
Rights Statement
Rights URI