On Parameter Efficiency of Neural Language Models
Author(s)
Liang, Chen
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
In recent years, pre-trained neural language models have achieved remarkable capabilities across various natural language understanding and generation tasks. However, the trend of scaling these models to encompass billions of parameters, while enhancing adaptability and emergent capabilities, has brought forth significant deployment challenges due to their massive size. These challenges include constraints in model storage and inference latency for real-world deployment, intensive time and computational costs for task adaptation, and the presence of substantial redundant parameters that affect task adaptability. Motivated by these challenges, this thesis aims to improve the parameter efficiency of these models, seeking to minimize storage requirements, accelerate inference and adaptation, and enhance generalizability.
\noindent {\it -- Improving Parameter Utilization in Neural Language Models} \\
While recent studies have identified significant redundancy in pre-trained neural language models, the impact of parameter redundancy on model generalizability remains largely underexplored. We first examine the relationship between parameter redundancy and model generalizability. Observing that removing redundant parameters improves generalizability, we propose an adaptive optimization algorithm for fine-tuning to improve the utilization of the redundant parameters. Experimental results validate increased generalization across various downstream tasks.
\noindent {\it -- Model Compression in Neural Language Models} \\
We explore model compression methods, including weight pruning and knowledge distillation, to reduce model storage and accelerate inference. We first develop a reliable iterative pruning method that accounts for uncertainties in training dynamics. Then, we dive into the realm of knowledge distillation, addressing the large teacher-student ``knowledge gap" that often hampers the student's performance. To tackle this, we offer two solutions for producing students for specific tasks by selectively distilling task-relevant knowledge. In scenarios demanding student adaptability across diverse tasks, we propose to reduce the knowledge gap by combining iterative pruning with distillation. Our approaches significantly surpass conventional distillation methods at similar compression ratios.
\noindent {\it -- Efficient Task Adaptation in Neural Language Models} \\
While fine-tuning is an essential adaptation method for attaining satisfactory performance on downstream tasks, it is both computation-intensive and time-consuming. To speed up task adaptation, we study the hypernetwork approach, which employs an auxiliary hypernetwork to swiftly generate task-specific weights based on few-shot demonstration examples. We improve the weight generation scheme by exploiting the intrinsic weight structure as an inductive bias, enhancing sample efficiency for hypernetwork training. The method shows superior generalization performance on unseen tasks compared to existing hypernetwork methods.
Sponsor
Date
2024-01-04
Extent
Resource Type
Text
Resource Subtype
Dissertation