Extending Multimodal Large Language Models Beyond Single Modalities

Author(s)
Vasu, Manushree
Advisor(s)
Shi, Humphrey
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computer Science
School established in 2007
Supplementary to:
Abstract
Multimodal large language models (MLLMs) aim to unify diverse sensory inputs—such as text, vision, and audio—within a single reasoning framework, enabling more comprehensive and human-like understanding. However, as the number of supported modalities increases, generalist MLLMs often underperform compared to specialist models due to cross-modal interference and imbalanced data quality across modalities. This thesis systematically investigates these challenges and proposes solutions through a series of controlled experiments. The proposed architecture integrates CLIP and BEATs encoders for vision and audio, respectively, with a Vicuna-v1.5 large language model, connected via lightweight two layer MLP projectors. A two-stage training protocol—modality alignment pretraining followed by supervised instruction tuning—is employed using carefully curated datasets for each modality. Experimental results demonstrate that true joint training, especially when paired with high-quality, targeted datasets, significantly mitigates cross-modal interference and enhances performance across both vision and audio tasks. Notably, simple projection layers outperform more complex alternatives like Q-Former in this setup, and instruction-following behaviors learned in well-resourced modalities readily transfer to data-scarce ones under joint training. The findings highlight the importance of training strategy, data quality, and architectural simplicity in building scalable, generalist MLLMs. Future work will extend these strategies to additional modalities and explore adaptive curriculum learning for continual multimodal integration.
Sponsor
Date
2025-04-30
Extent
Resource Type
Text
Resource Subtype
Thesis
Rights Statement
Rights URI