Towards Efficiently and Reliably Harnessing Pre-trained Language Models: A Data-centric Lens

Author(s)
Yu, Yue
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computational Science and Engineering
School established in May 2010
Supplementary to:
Abstract
Large Language Models (LLMs) excel across diverse tasks, but their performance depends heavily on high-quality data, which is often scarce and costly. In this talk, I will present three lines of research focused on developing data-centric approaches for enhancing LLM performance: (1) Cost-Efficient Data Collection: I discuss data-selection techniques that enable resource-efficient fine-tuning of LMs to boost performance with minimal data costs. (2) LLM-assisted Synthetic Data Generation: I design synthetic data generation approaches using LLMs to create diverse and unbiased datasets for classical NLP tasks, as well as generate synthetic critiques to improve reward modeling for LLM alignment. (3) Instruction Fine-Tuning for Trustworthy LLMs: I present a data-efficient instruction fine-tuning pipeline RankRAG that fine-tunes a single LM to handle both context ranking and answer generation, improving the efficacy of retrieval-augmented language models. These thrusts collectively form a comprehensive approach to the creation and utilization of high-quality data, holding significant promise for advancing the application of pretrained language models across diverse domains such as news, social media, biomedical, and healthcare. The primary goal of my work is to develop data-centric AI solutions that emphasize efficiency, reliability, and practical impact in the context of large-scale models.
Sponsor
Date
2024-12-09
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI