Title:
Fast and compact neural network via Tensor-Train reparameterization

Thumbnail Image
Author(s)
Yin, Chunxing
Authors
Advisor(s)
Vuduc, Richard W.
Advisor(s)
Editor(s)
Associated Organization(s)
Series
Supplementary to
Abstract
The exponential growth of data and model size poses a number of challenges for deep learning training. Large neural network layers can be parameterized based on tensor decomposition to compress model size, but at the potential costs of degraded accuracy and more execution time to reconstruct the layer parameters from the tensorized representation. In this dissertation, we explore neural network compression through Tensor Train (TT) reparameterization. We aim to develop efficient algorithms to accelerate training of tensorized networks while minimizing the memory consumption, and to understand the necessary components for Tensor Train format to succeed in model compression. We design efficient algorithms to accelerate the training of tensorized layers in Convolutional Neural Networks (CNNs), Deep Learning Recommendation Models (DLRMs), and in Graph Neural Networks (GNNs). While the use of TT for compression in CNNs has been suggested in the past, the prior art has not demonstrated significant speedups for training or inference. The reason is that conventional implementations of TT-compressed convolutional layers pose several challenges: increases in computational work for reconstructing TT-compressed layers, increases in memory footprint due to weight reconstruction, and limitations to parallel scalability as the effective problem sizes shrink under compression. We address these issues through asymptotic reductions in computation, avoidance of data movement, and an alternative parallelization strategy that significantly improves scalability. In recommendation models, the performance of TT-compressed DLRM (TT-Rec) is further optimized with the batched matrix multiplication and caching strategies for embedding vector lookup operations. In addition, we present mathematically and empirically the effect of weight initialization distribution on DLRM accuracy and propose to initialize the tensor cores of TT-Rec following the sampled Gaussian distribution. In the next part of this dissertation, we study the node embeddings in graph neural networks where both the numerical features and topological graph information need to be preserved. We design training schemes that unify hierarchical tensor decomposition and graph topology to exploit graph homophily, as well as to develop novel parameter initialization algorithms that introduces graph spectrum to improve model convergence and accuracy. Finally, we evaluate our technique on million-node graphs to demonstrate the efficiency and accuracy in real-world graphs, as well as on synthetic graphs to understand the correlation between graph homophily and weight sharing in TT. While the primary focus of this dissertation lies in exploring proof-of-concept algorithms, its outcomes can hold significant implications for systems. For example, by transforming the data-intensive embedding operator to compute-intensive and memory-efficient tensorized embedding, we can potentially reconfigure the allocation of system resources within a heterogeneous data-center with a combination of CPUs and GPUs. Moreover, our compression technique would enable storing large modules on a limited-memory accelerator with data-parallelism, thereby providing opportunities for optimizing communication.
Sponsor
Date Issued
2023-08-28
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI