On the Efficiency and Steerability of Self-Attention Mechanism of Large Language Models

Loading...
Thumbnail Image
Author(s)
Zhang, Qingru
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computational Science and Engineering
School established in May 2010
Series
Supplementary to:
Abstract
Large language models (LLMs) have revolutionized natural language processing through their powerful self-attention mechanisms. In this thesis, we focus on two crucial aspects of self-attention: efficiency and steerability, and explore innovative prompting techniques to improve contextual understanding and complex reasoning. Mixed attention span for efficient long squence modeling. The full attention mechanism in transformers incurs quadratic computational costs, severely limiting scalability for long sequences. Sparse attention mechanisms reduce complexity but struggle to capture intricate dependencies. In chapter 2, we propose MASFormer, a easy-to-implement transformer variant equipping full attention in a few top layers and sparse attention in remaining layers. MASFormer effectively captures both short-range and sparse long-range dependencies, significantly enhancing computational efficiency without compromising performance. Near-lossless KV cache compression for efficient generative inference. Generative inference in LLMs depends heavily on Key-Value (KV) caches, which can become memory-intensive as sequence length grows, leading to memory-bound inference and significantly reduced throughput. Existing compression methods, such as token dropping and quantization, introduce substantial approximation errors, resulting in substantial accuracy degeneration. In chapter 3, we introduce GEAR, an efficient error-reduction framework that augments ultra-low precision quantization with two error-reduction techniques. GEAR achieves near-lossless performance while significantly reducing memory consumption and improving inference throughput. Post-hoc attention steering to guide model attention. In human-written articles, we frequently relies on textual emphasis to direct reader's attention. However, existing LLMs typically process plain text without explicit mechanisms for attention guidance, limiting users' ability to steer model focus effectively. In chapter 4, we develop PASTA, a method allowing users to guide LLM attention through post-hoc attention steering. Without requiring retraining, PASTA significantly improves models' adherence to user instructions and integration of specified contextual information, enhancing model controllability and performance. Steerable prompting to imprve reading comprehension. LLMs often struggle with accurately comprehending extensive or complex contexts, leading to erroneous or hallucinated responses, especially problematic in open-book QA tasks. In chapter 5, we propose SteerPrompt, an inference-time prompting method that automatically identifies important contexts and explicitly highlights them by attention steering. SteerPrompt significantly improve model reading comprehension and accuracy in openbook QA tasks. Symphony of thoughts prompting to imporve mathematical reasoning. Existing reasoning methods in LLMs, like chain-of-thought prompting, sequentially solve subproblems, which can propagate errors and undermine accuracy, particularly in intricate mathematical modeling and optimization (MMO) problems. In chapter 6, we introduce Symphony of Thoughts (SoT), a parallel reasoning strategy that decomposes problems into simultaneously solvable subproblems, minimizing error propagation. SoT, validated through the newly proposed GEMMO benchmark, achieves greater robustness and accuracy in solving complex mathematical reasoning tasks.
Sponsor
Date
2025-04-23
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI