Algorithm–Hardware Co-Design Of Digital Compute-In-Memory Architecture Supporting Flexible And Temporal N:M Sparsity
Author(s)
Ramachandran, Akshat
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Foundational models continue to grow in size and capability, increasing demands on memory, energy, and computation. This has motivated algorithm–hardware co-design approaches for efficient inference, with structured N:M sparsity emerging as a promising method to reduce overhead. However, fixed sparsity configurations across layers and decoding steps can limit model expressivity and degrade accuracy, while supporting multiple configurations introduces challenges in both pattern selection and hardware efficiency. This thesis addresses these challenges through a co-design approach that enables flexible and dynamic N:M sparsity. At the algorithm level, it introduces FLOW, a layer-wise framework that selects sparsity configurations based on the magnitude and distribution of outliers, and FLOW++, which extends this approach to the temporal domain by adapting sparsity during different decoding steps. At the hardware level, it presents FlexCiM, a low-overhead digital compute-in-memory architecture designed to efficiently support models with varying sparsity patterns. Together, these contributions demonstrate that flexible N:M sparsity, when co-designed with hardware, can effectively balance model expressivity and computational efficiency.
Sponsor
Date
2026-05
Extent
Resource Type
Text
Resource Subtype
Thesis (Masters Degree)