Cleaning and Learning Over Dirty Tabular Data
Loading...
Author(s)
Li, Peng
Advisor(s)
Rong, Kexin
Editor(s)
Collections
Supplementary to:
Permanent Link
Abstract
The quality of machine learning (ML) applications is only as good as the quality of the data they train on. Unfortunately, real-world data is rarely free of errors, especially for tabular data, which frequently suffers from data issues like missing values, outliers, and inconsistencies. Therefore, data cleaning is widely regarded as an essential step in an ML workflow and an effective way to improve ML performance. However, data cleaning is often a time-consuming and expensive process that heavily relies on human efforts. Traditional data cleaning approaches often treat data cleaning as a standalone task independently of its downstream applications, which may not effectively improve ML performance and can sometimes even worsen it. Furthermore, it often leads to unnecessary costs for cleaning errors that have a minor impact on ML performance.
This dissertation jointly considers data cleaning and machine learning, and focuses on developing algorithms and systems for cleaning and learning over dirty tabular data, with the dual objectives of (1) optimizing downstream ML performance and (2) minimizing human efforts. We start with a CleanML empirical study that systematically evaluates the impact of data cleaning on downstream ML performance. We then introduce CPClean, a cost-effective human-involved data cleaning algorithm for ML that minimizes human cleaning efforts while preserving ML performance. We subsequently demonstrate DiffPrep, an automatic data preprocessing method that can efficiently select data preprocessing (cleaning) pipelines to maximize downstream ML performance without human involvement. Finally, we present Auto-Tables that can automatically transform tables from non-standard formats into a standard format without any human effort. The works in this dissertation can be integrated into a comprehensive system for cleaning and learning over dirty tabular data.
Sponsor
Date
2023-12-11
Extent
Resource Type
Text
Resource Subtype
Dissertation