Knowledge-Augmented Vision-and-Language Assistant
Loading...
Author(s)
Kuo, Chia-Wen
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
This PhD thesis explores the concept of knowledge augmentation in vision-and-language (VL) models, a pivotal area in artificial intelligence that enhances how machines perceive and articulate visual content. While the potential of these models is immense, their development is hampered by a notable shortfall in knowledge within both the visual and linguistic realms. This deficiency stems from the reliance on pre-trained vision models that often miss crucial details and the combination of large foundational models pre-trained on single-modal data, which do not fully grasp the intricacies of VL cross-modal knowledge. Conventional solutions, such as scaling up model sizes and datasets, present their own set of challenges in the VL domain, particularly the difficulty and risk of fine-tuning. This thesis, therefore, proposes a novel approach: supplementing the VL models with external knowledge sources rather than retraining them to internalize all missing information. At the outset of this research, the standard practice in VL research heavily relied on using only a frozen pre-trained detector for image encoding. This thesis was the first to apply knowledge augmentation to VL tasks, utilizing free-form text descriptions to provide the missing information. As the research progresses, it broadens its scope, incorporating a richer array of knowledge sources to include more extensive knowledge. A significant milestone is reached with the advent of larger, more advanced VL models. Here, the thesis seamlessly integrates the concept of knowledge augmentation, adeptly addressing the challenges of integrating and applying high-quality knowledge sources in more advanced AI systems. The sequential approach of this thesis bridges critical gaps in both vision and language processing, markedly improving the models’ ability to interpret and articulate visual data with rich context and linguistic coherence. Extensive analysis and experimentation have demonstrated that VL models enhanced with knowledge augmentation can produce descriptions that are not only more accurate but also richer and more detailed, with a noticeable reduction in errors and hallucinations.
Sponsor
Date
2023-12-10
Extent
Resource Type
Text
Resource Subtype
Dissertation