Robust, Efficient, and Adaptable Multimodal Artificial Intelligence for Vertical Applications
Author(s)
Verma, Gaurav
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Large artificial intelligence (AI) models have drawn widespread attention for their impressive — sometimes “superhuman” — performance on standardized benchmarks. Yet, a decade of research applying these models in domains such as wellbeing, web safety, and education has revealed persistent challenges. Many popular models are brittle to small input variations, large language models (LLMs) are sensitive to prompt formatting, and their performance degrades in highly specialized settings. These shortcomings are often amplified when AI systems struggle to effectively support diverse user groups, such as users with a lower Need For Cognition.
To systematically address these challenges, this thesis introduces a framework for transforming foundational large AI models into real-world solutions by strengthening their vertical-agnostic properties and overcoming challenges in vertical-specific applications. Large AI models that process, understand, and generate multimodal data — spanning vision and language — form the backbone of human-like AI interaction and enable richer world understanding than unimodal systems. Toward this goal, this thesis advances the use of large multimodal models across multiple verticals by addressing three vertical-agnostic properties: (a) robustness to realistic data variations, (b) efficient cross-modal mapping, and (c) adaptability to new tasks and domains. First, we assess how current multimodal models respond to cross-modally grounded input variations and expose their brittleness. Next, we propose a simple yet effective method to capture text "visualness," thereby improving the efficiency of text-to-image retrieval and generation. Finally, we demonstrate how to adapt multimodal agents to custom workflows with minimal human demonstrations. Addressing these issues of robustness, efficiency, and adaptability lowers the barrier to integrating multimodal AI across verticals and enables more effective remediation techniques.
Building on these foundational elements, the thesis transitions from vertical-agnostic properties to vertical-specific applications. We show that delivering value in specific verticals requires tailored data, models, and evaluation. Focusing on web safety and well-being, we collaborate with domain experts to (a) characterize and detect violence-provoking speech and (b) use LLMs to uncover personal well-being insights that can inform policy making. These focused efforts yield a nuanced view of current AI strengths and limitations. Concluding these explorations, we show how vertical-specific insights can feed back into model improvement—highlighting inequitable outcomes across languages and demonstrating that multimodal training can help mitigate these disparities.
Sponsor
Date
2025-04-23
Extent
Resource Type
Text
Resource Subtype
Dissertation