Less is More: Accelerating Vision by Eliminating Redundancy

Author(s)
Bolya, Daniel
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Interactive Computing
School established in 2007
Supplementary to:
Abstract
The key to modern machine learning is scale. With more data, bigger models, and more compute, as a community, we've found that the problems once deemed impossible for a computer to solve have rapidly become attainable---many even becoming easy with today's techniques. But as the scale of modern machine learning has ballooned, so too has its cost. Large transformer models, for instance, can require multiple hundreds of GPUs to train effectively and can be similarly unwieldy to deploy. In this dissertation, I aim to reduce those costs. Specifically, this work focuses on Vision Transformers (ViTs), which have been the dominant driving force in scaling machine learning for computer vision. Over the course of this dissertation, I show that these ViTs perform redundant computation, and that by exploiting these redundancies, we can greatly increase the efficiency of these systems, both during training and inference. In Part 1, I show that we can reduce the amount of spatial computation required by these transformers without losing performance. In Part 2, I show that certain architectural components are redundant and can be removed or greatly simplified. In Part 3, I show how we can exploit redundant features within models to speed them up and to circumvent training. Finally, in Part 4, I show that these speed-ups can compound on each other, resulting in a much faster model.
Sponsor
Date
2024-04-24
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI