Title:
Data-Centric Bias Mitigation in Machine Learning Life Cycle

Thumbnail Image
Author(s)
Zhang, Hantian
Authors
Advisor(s)
Chu, Xu
Rong, Kexin
Advisor(s)
Person
Person
Editor(s)
Associated Organization(s)
Organizational Unit
Organizational Unit
School of Computer Science
School established in 2007
Supplementary to
Abstract
As Machine Learning (ML) becomes increasingly central to decision-making processes in our society, it is crucial to acknowledge the potential of these ML models to inadvertently perpetuate biases, disproportionately impacting certain demographic groups and individuals. For instance, some ML models used in judicial systems have shown biases against African Americans when predicting recidivism rates. Therefore, addressing the inherent biases and ensuring fairness in ML models is imperative. While enhancements in fairness can be implemented by changing the ML models directly, we argue that a more foundational solution lies in correcting the data as biased data is often the root cause of unfairness. In this dissertation, we aim to systematically understand and mitigate biases in ML models in the full ML life-cycle, from data preparation (pre-processing), to model training (in-processing) and model validation (post-processing). First, we develop a pioneering system, iFlipper, that optimizes for individual fairness in ML. iFlipper enhances training data during data preparation by adjusting the labels, thus mitigating inconsistencies that arise when similar individuals receive varying outcomes. Experiments on real datasets show that iFlipper significantly outperforms other pre-processing baselines in terms of individual fairness and accuracy on unseen test sets. Subsequently, we introduce a declarative system OmniFair that aims at bolstering group fairness in ML. OmniFair allows users to define specific group fairness constraints and change the weight of each training sample during the training process to achieve given group fairness constraints. We show that OmniFair is more versatile than existing algorithmic fairness approaches in terms of both supported fairness constraints and downstream ML models. OmniFair reduces the accuracy loss by up to 94.8% compared with the second best method. Finally, we present a method to discover and explain semantically coherent subsets (slices) of unstructured data where the ML models underperform after the models are trained. To be specific, we introduce a new perspective for quantifying explainability in unstructured data slices by borrowing the concept of separability from machine learning literature. We find that separability, which captures how well a slice can be differentiated from the rest of the dataset, complements the coherence measure that focuses on the commonalities of all instances within a slice. Preliminary results demonstrate that a separability-based slice discovery algorithm can identify complementary data slices to existing, coherence-based approaches. The three works in this dissertation can be integrated in to a comprehensive system that reduces bias in data in the full machine learning life cycle, which can cover different fairness metrics and different types of data. To be specific, iFlipper is responsible for structured data and individual fairness in the data preparation step. OmniFair is responsible for structured data and group fairness in the model training step. And the slice discovery is responsible for unstructured data in the model validation step.
Sponsor
Date Issued
2024-07-26
Extent
Resource Type
Text
Resource Subtype
Dissertation
Rights Statement
Rights URI