Energy-Efficient Hardware Acceleration of Transformer-Based Models
Author(s)
Byun, Woohong
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
This research presents a software-hardware co-optimization framework for energy-efficient deployment of transformer models on FPGAs. It introduces novel quantization techniques tailored for both BERT and generative LLMs. For weight quantization, the proposed Hessian-based parameter-wise method assigns optimal bit precision based on sensitivity analysis, while the proposed row-wise quantization enhances hardware efficiency by converting mixed-precision matrices into two uniform-precision blocks. For attention activations, the proposed Weight-Hessian-aware KV cache quantization applies intra-layer mixed-precision using precomputed sensitivities, eliminating runtime overhead. To further improve hardware efficiency, the proposed Query-Key coupled scheme aligns bit precision within each outer product pair, reducing implementation complexity. A concurrent quantization approach jointly optimizes row-wise weight and Query-Key activation precision using multi-precision formats, improving both compression and energy efficiency. These techniques are implemented on a novel multi-precision FPGA accelerator for BERT and GPT-2, capable of handling both power-of-two and non-power-of-two bit-widths. With optimized dataflow, the design minimizes off-chip memory access and significantly outperforms prior solutions in both energy efficiency and inference performance.
Sponsor
Date
2025-07-08
Extent
Resource Type
Text
Resource Subtype
Dissertation