AutoCurate : Automating Domain-Specific Dataset Curation for Large Language Models

Author(s)
Gupta, Anant
Advisor(s)
Editor(s)
Associated Organization(s)
Series
Supplementary to:
Abstract
Large Language Models (LLMs) have demonstrated remarkable performance on open-domain tasks, yet they often struggle with factual accuracy and terminology in specialized domains such as finance, law, or medicine. Existing approaches to building domain-specific LLMs typically rely on fully curated in-domain corpora, which are expensive and labor-intensive to assemble at scale. In this work, we propose a novel, scalable pipeline for domain-specific dataset curation that minimizes manual intervention. Our method combines topic modeling and BM25-based filtering to iteratively expand a small seed corpus into a high-quality, domain-relevant dataset drawn from large-scale generic corpora. We demonstrate the effectiveness of our approach by curating a financial dataset exceeding 100 billion tokens from the Dolma corpus. We will publicly release the datasets and an industrial-grade implementation of our pipeline to facilitate its application across other domains. Overall, our work presents a practical and extensible solution for building high-quality domain-specific training corpora, advancing the development of reliable, specialized LLMs.
Sponsor
Date
Extent
Resource Type
Text
Resource Subtype
Undergraduate Research Option Thesis
Rights Statement
Rights URI