Title:
Quantitative convergence analysis of dynamical processes in machine learning
Quantitative convergence analysis of dynamical processes in machine learning
Author(s)
Wang, Yuqing
Advisor(s)
Tao, Molei
Editor(s)
Collections
Supplementary to
Permanent Link
Abstract
This thesis focuses on analyzing the quantitative convergence of selected important machine learning processes, from a dynamical perspective, in order to understand and guide machine learning practices. Machine learning is becoming increasingly popular in various fields. Typical machine learning models consist of optimization and generalization processes, where the performance of these two depends on the network architectures, algorithms, learning rate, batch size, training strategies, etc. Many of these processes and designs utilize, either explicitly or implicitly, (nonlinear) dynamics and can be theoretically understood via (more) refined convergence analysis. More precisely:
The first part of this thesis illustrates the effect of a large learning rate on optimization dynamics, which often correlates with improved generalization. Specifically, we consider non-convex and non-Lipschitz-smooth potential functions in matrix factorization problems minimized by gradient descent (GD) with large learning rates, which is beyond the scope of classical optimization theory. We develop a new convergence analysis to show that the large learning rate biases GD towards flatter minima, where the two factors in the matrix factorization objective are more balanced. The second part is an extension of the theory in the first part to a unified mechanism of several implicit biases including edge of stability, balancing, and catapult. We broaden the previous convergence analysis to a family of objective functions with various regularities, where good regularities combined with large learning rates result in the occurrence of these phenomena. In the third part, we concentrate on diffusion models, which is a concrete and important real-world application, and theoretically demonstrate how to choose its hyperparameters for good performance through the convergence analysis of the full generation process, including optimization and sampling. It turns out that our theory is consistent with practical usage with leading empirical results. In the fourth part of this thesis, we study the generalization performance of different architectures, deep residual networks (ResNets), and deep feedforward networks (FFNets). By taking these architectures as iterative maps and analyzing their convergence via neural tangent kernel, we prove that deep ResNets can effectively separate data while deep FFNets degenerate and lose their learnability.
Sponsor
Date Issued
2024-07-27
Extent
Resource Type
Text
Resource Subtype
Dissertation