Learning Vision and Language Cues for Video Understanding in Egocentric and Instructional Videos
Author(s)
Beedu, Apoorva
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
We perceive the world through a combination of senses: such as sound, smell, and vi-
sion, to learn from and interact with our surroundings.Among these, vision and hearing are
the primary sources of information gathering, especially through reading and listening. Ef-
fectively utilizing and combining these senses is key to developing intelligent systems that
can operate in and understand complex environments. A critical challenge hindering effec-
tive vision-language learning is an understanding of why and how to effectively integrate
language for improved video understanding.
In this dissertation, we leverage the language modality to learn effective video repre-
sentations across a range of tasks, including action recognition, forecasting, and summa-
rization. The key ideas developed in this thesis are (i) Vision-Language supervision for
action understanding, and (ii) Leveraging language for video summarization.
In Vision-Language supervision for action understanding, we generate rich action de-
scriptions and leverage information from multiple modalities to recognize and anticipate
future actions in videos. We also discover the extent to which language contributes in un-
derstanding actions in videos, through effective cross-modal supervision between the vision
and language modalities.
Finally in Leveraging language for video summarization, we generate text outputs for
every input modality, and evaluate the performance of foundational models on video sum-
marization task. By using text as the primary mode of input, we evaluate how the text
representations perform on video summarization. Building on this, we propose a hierarchi-
cal framework that incorporates multi-granular language cues and evaluate its effectiveness
for video summarization.
Sponsor
Date
2025-08-25
Extent
Resource Type
Text
Resource Subtype
Dissertation (PhD)