Robustness To Visual Perturbations In Pixel-Based Tasks

Thumbnail Image
Yung, Dylan D.
Kira, Zsolt
Hoffman, Judy
Associated Organization(s)
Organizational Unit
Organizational Unit
Supplementary to
Convolutional Neural Networks (CNNs) have been shown to provide great utility across many vision tasks and have become the go-to model for problems involving video or image input. Though they’ve shown promise across many problems they come with inherent flaws. For example, in image classification, CNNs are known to output very high confidence values even when their accuracy is low. This is exacerbated when visual perturbations are introduced to inputs causing accuracy to drop, but confidence to remain high. This is similarly problematic when models use visual inputs for decision-making, such as through pixel-based Reinforcement Learning (RL) where an agent must learn a policy leveraging images of the environment as input. RL agents under these settings can perform well in training, but once deployed may face unseen visual perturbation, causing an erroneous execution of their learned task. Poor robustness to the previously mentioned examples is deadly in applied Machine Learning (ML) in the medical field and autonomous vehicles. Thus ways to impart robustness on CNNs for image classification and RL are of utmost importance. In this thesis, we explore solutions to the problem of overconfident image classification models and embedding robustness to visual perturbations in RL. We propose two distinct frameworks for doing so in two contexts: Image-based classification (Geometric Sensitivity Decomposition (GSD)) and decision-making (Augmentation Curriculum Learning (AugCL)). CNNs utilized for image classification has been shown to be erroneously overconfident. A large contributor to the overconfidence is attributed to a combination of Cross-Entropy loss, the standard loss for classification, and the final linear layer typically in vision models. GSD decomposes the norm of a sample feature embedding and the angular similarity to a target classifier into an instance-dependent and an instance-independent component. The instance-dependent component captures the sensitive information about changes in the input while the instance-independent component represents the insensitive information serving solely to minimize the loss on the training dataset. Inspired by the decomposition, we analytically derive a simple extension to current softmax-linear models, which learns to disentangle the two components during training. On several common vision models, the disentangled model outperforms other calibration methods on standard calibration metrics in the face of out-of-distribution (OOD) data and corruption with significantly less complexity. Specifically, we surpass the current state of the art by 30.8% relative improvement on corrupted CIFAR100 in Expected Calibration Error. Pixel-based RL has shown a lack of ability to identify and learn visual features when things such as color have been changed. Image augmentation has been shown to add to this, but is difficult to balance. AugCL is a novel curriculum learning approach that schedules image augmentation into training into a weak augmentation phase and a strong augmentation phase. We also introduce a novel visual augmentation strategy that proves to aid in the benchmarks we evaluate on. Our method achieves state-of-the-art performance on Deep Mind Control Generalization Benchmark when combined with previous methods.
Date Issued
Resource Type
Resource Subtype
Rights Statement
Rights URI