Neighborhood Attention: Fast and Flexible Sparse Attention

Author(s)
Hassani, Ali
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Interactive Computing
School established in 2007
Supplementary to:
Abstract
Attention is at the heart of most foundational AI models, across tasks and modalities. In many of those cases, it incurs a significant amount of computation, which is quadratic in complexity, and often cited as one of its greatest limitations. As a result, many sparse approaches have been proposed to alleviate this issue, with one of the most common approaches being masked or reduced attention span. In this work, we revisit sliding window approaches, which were commonly believed to be inherently inefficient, and we propose a new framework called Neighborhood Attention (NA). Through it, we solve design flaws in the original sliding window attention works, attempt to implement the approach efficiently for modern hardware accelerators, specifically GPUs, and conduct experiments that highlight the strengths and weaknesses of these approaches. At the same time, we bridge the parameterization and properties of Convolution and Attention, by showing that NA exhibits inductive biases and receptive fields similar to that in convolutions, while still capable of capturing inter-dependencies, both short and long range, similar to attention. We then show the necessity for and challenges that arise from infrastructure, especially in the context of modern implementations such as Flash Attention, and develop even more efficient and performance-optimized implementations for NA, specifically for the most recent and popular AI hardware accelerators, the NVIDIA Hopper and Blackwell GPUs. We build models based on the NA family, highlighting its superior quality and efficiency compared to existing approaches, and also plug NA into existing foundational models, and showing that it can accelerate those models by up to 1.6× end-to-end and without further training, and up to 2.6× end-to-end with training. We further demonstrate that our methodology can actually create sparse Attention patterns that realize the theoretical limit of their speedups. This work is open-sourced through the NATTEN project at natten.org.
Sponsor
Date
2026-05
Extent
Resource Type
Text
Resource Subtype
Dissertation (PhD)
Rights Statement
Rights URI