Improving Foundation Models
Loading...
Author(s)
Komatsuzaki, Aran
Advisor(s)
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
Foundation models are the family of models (e.g. GPT-4, CLIP) that are trained on a massive dataset and perform various down-streaming tasks, usually with either zero- or few-shot learning, optionally after fine-tuning. This dissertation presents a wide range of important measures we have made to make foundation models more efficient, performant and versatile. In particular, we focus on three points of improvement: architecture, dataset and training. We first present our findings on how to optimally scale language models, which leads to significant performance improvement. We then present GPT-J, one of the earliest open-source large language models. We then show that the performance of ViT and T5, both Transformer-based foundation models, can be greatly improved for a given compute budget using Sparse Upcycling, which is to resume training a sparsely gated model made out of pretrained dense models. We also briefly discuss LAION datasets, massive open-source datasets with around one billion pairs of text and image that are used to train various state-of-the-art multimodal models, and ARB benchmark, a highly challenging benchmark to measure the state-of-the-art LLMs such as GPT-4. On the theoretical side, we prove that feedforward layers of a transformer cannot be compressed without information loss, which may explain the power of sparsely gated models such as mixture-of-experts.
Sponsor
Date
2023-12-10
Extent
Resource Type
Text
Resource Subtype
Dissertation