Comparing the Performance of Small Word-Size Floating-Point Numerics to Fixed-Point Numerics in Neural Networks

Author(s)
Vennapusa, Lakshmi Grishma
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
Daniel Guggenheim School of Aerospace Engineering
The Daniel Guggenheim School of Aeronautics was established in 1931, with a name change in 1962 to the School of Aerospace Engineering
Series
Supplementary to:
Abstract
As neural network architectures grow in depth and complexity, training them efficiently under hardware constraints has become increasingly important. While fixed-point arithmetic offers resource advantages, it suffers from limited dynamic range and quantization inflexibility. This thesis introduces an alternative approach—Adaptive Precision Training (APT)—which leverages reduced-precision floating-point formats (FP8, FP12, FP16) for dynamic, layer-wise quantization during training. APT monitors per-layer Quantization Error Measurement (QEM) to guide precision adjustments and incorporates a novel bit-shuffling mechanism to reallocate bits between exponent and mantissa fields before escalating to higher-precision formats. This fine-grained control enables minimal precision escalation while preserving numerical fidelity. The APT framework is implemented in software and evaluated using an AlexNet-style model on the SVHN dataset. The experiments compare three training configurations: a fixed-point baseline, adaptive floating-point quantization, and adaptive quantization with bit-shuffling. Results show that the APT models achieve higher validation accuracy and smoother convergence, while maintaining most training in FP8 and FP12. Although memory usage increases due to dynamic quantization emulation, training time per epoch remains competitive. This work demonstrates that dynamic floating-point quantization—augmented with intra-format bit reallocation—offers a scalable and efficient alternative to fixed-point training, particularly for hardware-aware deep learning applications.
Sponsor
Date
2025-05-28
Extent
Resource Type
Text
Resource Subtype
Thesis
Rights Statement
Rights URI