On the Resource Efficiency of Language Models

Loading...
Thumbnail Image
Author(s)
Zhang, Rongzhi
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computational Science and Engineering
School established in May 2010
Series
Supplementary to:
Abstract
Large language models (LLMs) have revolutionized a wide range of natural language processing tasks. However, their practical utility faces resource challenges in two dimensions: data efficiency and model efficiency. For post-training, LLMs face data curation challenges where high-quality labeled data is scarce and expensive to obtain, and data utilization challenges where existing methods fail to optimize model performance on limited available data. During deployment, LLMs encounter model parameter efficiency challenges from their enormous size, and inference efficiency constraints from memory-intensive operations. This thesis addresses these resource bottlenecks through two complementary thrusts: Thrust 1: How to enhance data efficiency in the post-training stage? LLM adaptation to specific tasks and human values remains data-intensive. Two obstacles dominate this landscape: first, the scarcity of high-quality labeled data creates bottlenecks for domain adaptation; second, existing methods fail to extract maximum value from limited training examples. I introduce PRBoost, an interactive weakly-supervised framework that discovers high-quality labeling rules, and DORM, which optimizes preference data weights to reduce data requirements for model alignment by 10 to 40 times. Thrust 2: How to improve model efficiency in the deployment stage? LLMs' parameter counts and memory demands restrict deployment across diverse computing environments. I tackle this through two approaches: improving knowledge distillation for more efficient models, and optimizing inference-time memory usage. PTLoss, a perturbation-based distillation framework, enhances student model generalization by generating more accurate proxy teacher distributions. LoRC progressively compresses KV cache, substantially reducing GPU memory requirements while preserving performance. These complementary thrusts form a framework for efficient, deployable language models. The findings demonstrate that targeted resource optimization enables broader LLM deployment across diverse applications and computational environments.
Sponsor
Date
2025-04-29
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI