This section outlines two classes of methods that steer optimization during training to relieve distributional shift, i.e., *Cross-Distribution Aggregation* and *Navigation via Mode Connectivity*.

## Cross-Distribution Aggregation

One of the main reasons for the distributional shift is spurious correlations in the model that are distinct from core objectives. By integrating learning information of different domains (or different distributions) into the optimization objective, we expect the model to learn truthful information and invariant relationships.

### ERM: Empirical Risk Minimization

We first introduce ERM as the background and then introduce some methods to directly learn how to address distributional shifts by integrating loss landscapes of different distributions in the training process.

Consider a scenario where a model has been developed to effectively identify objects by their features. The optimization target can be expressed as:

$$\mathrm{R} (w) = \int \mathrm{L} (y, f(x, w)) \, d\mathrm{P}(x, y)$$where $\mathrm{L} (y, f(x, w))$ denotes the loss between data labels $y$ and model outputs $f(x, w)$, while $\mathrm{P}(x, y)$ signifies the target data distribution.

A bias often exists between the dataset and the real world, implying that the features learned from the dataset may not necessarily be the ones we intend for the model to acquire. ERM is a strategy employed in statistical methods to optimize this bias.

### DRO: Distributionally Robust Optimization

Despite the remarkable performance of neural networks, they can sometimes exhibit a pronounced sensitivity to distributional shifts. OOD Generalization can be formulated as follows: $$ \begin{align*} r_{\mathcal{D}}^{\mathrm{OOD}}(\theta)=\max _{e \in \mathcal{D}} r_e(\theta) \end{align*} $$ This optimization seeks to enhance worst-case performance across a perturbation set, denoted as $\mathcal{D}$, by reducing the maximum value among the risk function set ${ r_e | e \in \mathcal{D}}$.

In Distributionally Robustness Optimization (DRO), the perturbation set covers the mixture of different domains’ training distributions.

**Recommended Papers List**

Causal inference using invariant prediction: identification and confidence intervals

## Click to have a preview.

What is the difference of a prediction that is made with a causal model and a non-causal model? Suppose we intervene on the predictor variables or change the whole environment. The predictions from a causal model will in general work as well under interventions as for observational data. In contrast, predictions from a non-causal model can potentially be very wrong if we actively intervene on variables. Here, we propose to exploit this invariance of a prediction under a causal model for causal inference: given different experimental settings (for example various interventions) we collect all models that do show invariance in their predictive accuracy across settings and interventions. The causal model will be a member of this set of models with high probability. This approach yields valid confidence intervals for the causal relationships in quite general scenarios. We examine the example of structural equation models in more detail and provide sufficient assumptions under which the set of causal predictors becomes identifiable. We further investigate robustness properties of our approach under model misspecification and discuss possible extensions. The empirical properties are studied for various data sets, including large-scale gene perturbation experiments.

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

## Click to have a preview.

In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, ImageNet-C, standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications. Then we propose a new dataset called ImageNet-P which enables researchers to benchmark a classifier’s robustness to common perturbations. Unlike recent robustness research, this benchmark evaluates performance on common corruptions and perturbations not worst-case adversarial perturbations. We find that there are negligible changes in relative corruption robustness from AlexNet classifiers to ResNet classifiers. Afterward we discover ways to enhance corruption and perturbation robustness. We even find that a bypassed adversarial defense provides substantial common perturbation robustness. Together our benchmarks may aid future work toward networks that robustly generalize.

## Click to have a preview.

Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, the poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization—a stronger-than-typical L2 penalty or early stopping—we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is important for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce a stochastic optimization algorithm, with convergence guarantees, to efficiently train group DRO models.

Do imagenet classifiers generalize to imagenet?

## Click to have a preview.

We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3%-15% on CIFAR-10 and 11%-14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models’ inability to generalize to slightly" harder" images than those found in the original test sets.

Exploring the landscape of spatial robustness

## Click to have a preview.

The study of adversarial robustness has so far largely focused on perturbations bound in -norms. However, state-of-the-art models turn out to be also vulnerable to other, more natural classes of perturbations such as translations and rotations. In this work, we thoroughly investigate the vulnerability of neural network–based classifiers to rotations and translations. While data augmentation offers relatively small robustness, we use ideas from robust optimization and test-time input aggregation to significantly improve robustness. Finally we find that, in contrast to the -norm case, first-order methods cannot reliably find worst-case perturbations. This highlights spatial robustness as a fundamentally different setting requiring additional study.

## Click to have a preview.

Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on “Stylized-ImageNet”, a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.

Out-of-distribution generalization via risk extrapolation (rex)

## Click to have a preview.

Distributional shift is one of the major obstacles when transferring machine learning prediction systems from the lab to the real world. To tackle this problem, we assume that variation across training domains is representative of the variation we might encounter at test time, but also that shifts at test time may be more extreme in magnitude. In particular, we show that reducing differences in risk across training domains can reduce a model’s sensitivity to a wide range of extreme distributional shifts, including the challenging setting where the input contains both causal and anti-causal elements. We motivate this approach, Risk Extrapolation (REx), as a form of robust optimization over a perturbation set of extrapolated domains (MM-REx), and propose a penalty on the variance of training risks (V-REx) as a simpler variant. We prove that variants of REx can recover the causal mechanisms of the targets, while also providing robustness to changes in the input distribution (“covariate shift”). By appropriately trading-off robustness to causally induced distributional shifts and covariate shift, REx is able to outperform alternative methods such as Invariant Risk Minimization in situations where these types of shift co-occur.

Recognition in terra incognita

## Click to have a preview.

It is desirable for detection and classification algorithms to generalize to unfamiliar environments, but suitable benchmarks for quantitatively studying this phenomenon are not yet available. We present a dataset designed to measure recognition generalization to novel environments. The images in our dataset are harvested from twenty camera traps deployed to monitor animal populations. Camera traps are fixed at one location, hence the background changes little across images; capture is triggered automatically, hence there is no human bias. The challenge is learning recognition in a handful of locations, and generalizing animal detection and classification to new locations where no training data is available. In our experiments state-of-the-art algorithms show excellent performance when tested at the same location where they were trained. However, we find that generalization to new locations is poor, especially for classification systems.

## Click to have a preview.

This book is devoted to Robust Optimization — a specific and relatively novel methodology for handling optimization problems with uncertain data. The primary goal of this Preface is to provide the reader with a first impression of what the story is about:

- what is the phenomenon of data uncertainty and why it deserves a dedicated treatment,
- how this phenomenon is treated in Robust Optimization, and how this treatment compares to those offered by more traditional methodologies for handling data uncertainty. The secondary, quite standard, goal is to outline the main topics of the book and describe its contents.

Statistics of robust optimization: A generalized empirical likelihood approach

## Click to have a preview.

We study statistical inference and distributionally robust solution methods for stochastic optimization problems, focusing on confidence intervals for optimal values and solutions that achieve exact coverage asymptotically. We develop a generalized empirical likelihood framework—based on distributional uncertainty sets constructed from nonparametric f-divergence balls—for Hadamard differentiable functionals, and in particular, stochastic optimization problems. As consequences of this theory, we provide a principled method for choosing the size of distributional uncertainty regions to provide one- and two-sided confidence intervals that achieve exact coverage. We also give an asymptotic expansion for our distributionally robust formulation, showing how robustification regularizes problems by their variance. Finally, we show that optimizers of the distributionally robust formulations we study enjoy (essentially) the same consistency properties as those in classical sample average approximations. Our general approach applies to quickly mixing stationary sequences, including geometrically ergodic Harris recurrent Markov chains.

### IRM: Invariant Risk Minimization

First, let us formally define the previously mentioned *distribution shift*. Suppose the input distribution of the model, denoted as $X$, corresponds to the output labels represented by $Y$. Assuming that $P(Y \mid X)$ represents an invariant relationship, constraining the shift in distribution to variations in $P(X)$, we arrive at what is known as a covariate shift. The IRM method harnesses this invariant relationship to address covariate shift differently from the conventional ERM approach.

**Recommended Papers List**

## Click to have a preview.

We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

Principles of risk minimization for learning theory

## Click to have a preview.

Learning is posed as a problem of function estimation, for which two princi (cid: 173) ples of solution are considered: empirical risk minimization and structural risk minimization. These two principles are applied to two different state (cid: 173) ments of the function estimation problem: global and local. Systematic improvements in prediction power are illustrated in application to zip-code recognition.

### REx: Risk Extrapolation

The basic form of REx involves robust optimization over a perturbation set of extrapolated domains (MM-REx), with an additional penalty imposed on the variance of training risks (V-REx). By reducing training risks and increasing the similarity of training risks, REx forces the model to learn the invariant relationship in different domain distributions.

**Recommended Papers List**

Elements of causal inference: foundations and learning algorithms

## Click to have a preview.

A concise and self-contained introduction to causal inference, increasingly important in data science and machine learning.

Principles of risk minimization for learning theory

## Click to have a preview.

Learning is posed as a problem of function estimation, for which two princi (cid: 173) ples of solution are considered: empirical risk minimization and structural risk minimization. These two principles are applied to two different state (cid: 173) ments of the function estimation problem: global and local. Systematic improvements in prediction power are illustrated in application to zip-code recognition.

## Click to have a preview.

We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

## Navigation via Mode Connectivity

Here, we will introduce *mode connectivity* as the prerequisite content. Then, we will primarily discuss the *Connectivity-Based Fine-tuning (CBFT)* method, illustrating how mode connectivity navigates the model to predict based on invariant relationships instead of spurious correlations by changing little parameters.

### Mode Connectivity

Mode connectivity refers to identifying a straightforward path within the loss function space that connects two or more distinct local minima or patterns. A formal definition can be defined as follows:

The model’s loss on a dataset $\mathcal{D}$ is represented as $\mathcal{L}(f(\mathcal{D}; \theta))$, where $\theta$ denotes the optimal parameters of the model, and $f(\mathcal{D}; \theta)$ signifies the model trained on dataset $\mathcal{D}$. We define $\theta$ as a minimizer of the loss on this dataset if $\mathcal{L}(f(\mathcal{D}; \theta))<\epsilon$, where $\epsilon$ is a small scalar value.

Minimizers $\theta_1$ and $\theta_2$, achieved through training on dataset $\mathcal{D}$, are considered to be mode-connected if there exists a continuous path $\gamma$ from $\theta_1$ to $\theta_2$ such that, as $\theta_0$ varies along this path $\gamma$, the following condition is consistently upheld:

$$ \begin{align*} \forall t \in [0,1], \mathcal{L}\left(f\left(\mathcal{D}, \theta_0\right)\right) \leq t \cdot \mathcal{L}\left(f\left(\mathcal{D} ; \theta_1\right)\right) + \left(1-t\right) \cdot \mathcal{L}\left(f\left(\mathcal{D} ; \theta_2\right)\right). \end{align*} $$**Recommended Papers List**

## Click to have a preview.

We systematize the approach to the investigation of deep neural network landscapes by basing it on the geometry of the space of implemented functions rather than the space of parameters. Grouping classifiers into equivalence classes, we develop a standardized parameterization in which all symmetries are removed, resulting in a toroidal topology. On this space, we explore the error landscape rather than the loss. This lets us derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them. Using different optimization algorithms that sample minimizers with different flatness we study the mode connectivity and relative distances. Testing a variety of state-of-the-art architectures and benchmark datasets, we confirm the correlation between flatness and generalization performance; we further show that in function space flatter minima are closer to each other and that the barriers along the geodesics connecting them are small. We also find that minimizers found by variants of gradient descent can be connected by zero-error paths composed of two straight lines in parameter space, ie polygonal chains with a single bend. We observe similar qualitative results in neural networks with binary weights and activations, providing one of the first results concerning the connectivity in this setting. Our results hinge on symmetry removal, and are in remarkable agreement with the rich phenomenology described by some recent analytical studies performed on simple shallow models.

Essentially no barriers in neural network energy landscape

## Click to have a preview.

Training neural networks involves finding minima of a high-dimensional non-convex loss function. Relaxing from linear interpolations, we construct continuous paths between minima of recent neural network architectures on CIFAR10 and CIFAR100. Surprisingly, the paths are essentially flat in both the training and test landscapes. This implies that minima are perhaps best seen as points on a single connected manifold of low loss, rather than as the bottoms of distinct valleys.

Linear Connectivity Reveals Generalization Strategies

## Click to have a preview.

It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster – models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.

Linear mode connectivity and the lottery ticket hypothesis

## Click to have a preview.

We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (eg, random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy. We find that these subnetworks only reach full accuracy when they are stable to SGD noise, which either occurs at initialization for small-scale settings (MNIST) or early in training for large-scale settings (ResNet-50 and Inception-v3 on ImageNet).

Loss surface simplexes for mode connecting volumes and fast ensembling

## Click to have a preview.

With a better understanding of the loss surfaces for multilayer networks, we can build more robust and accurate training procedures. Recently it was discovered that independently trained SGD solutions can be connected along one-dimensional paths of near-constant training loss. In this paper, we in fact demonstrate the existence of mode-connecting simplicial complexes that form multi-dimensional manifolds of low loss, connecting many independently trained models. Building on this discovery, we show how to efficiently construct simplicial complexes for fast ensembling, outperforming independently trained deep ensembles in accuracy, calibration, and robustness to dataset shift. Notably, our approach is easy to apply and only requires a few training epochs to discover a low-loss simplex.

Loss surfaces, mode connectivity, and fast ensembling of dnns

## Click to have a preview.

The loss functions of deep neural networks are complex and their geometric properties are not well understood. We show that the optima of these complex loss functions are in fact connected by simple curves, over which training and test accuracy are nearly constant. We introduce a training procedure to discover these high-accuracy pathways between modes. Inspired by this new geometric insight, we also propose a new ensembling method entitled Fast Geometric Ensembling (FGE). Using FGE we can train high-performing ensembles in the time required to train a single model. We achieve improved performance compared to the recent state-of-the-art Snapshot Ensembles, on CIFAR-10, CIFAR-100, and ImageNet.

## Click to have a preview.

We study neural network loss landscapes through the lens of mode connectivity, the observation that minimizers of neural networks retrieved via training on a dataset are connected via simple paths of low loss. Specifically, we ask the following question: are minimizers that rely on different mechanisms for making their predictions connected via simple paths of low loss? We provide a definition of mechanistic similarity as shared invariances to input transformations and demonstrate that lack of linear connectivity between two models implies they use dissimilar mechanisms for making their predictions. Relevant to practice, this result helps us demonstrate that naive fine-tuning on a downstream dataset can fail to alter a model’s mechanisms, e.g., fine-tuning can fail to eliminate a model’s reliance on spurious attributes. Our analysis also motivates a method for targeted alteration of a model’s mechanisms, named connectivity-based fine-tuning (CBFT), which we analyze using several synthetic datasets for the task of reducing a model’s reliance on spurious attributes.

### CBFT: Connectivity-Based Fine-tuning

Models tend to develop similar inference mechanisms when trained on similar data. This could be a significant reason for the emergence of bias in models, such as relying on the background information of images for classification rather than the objects depicted in the photos. To overcome this problem, CBFT proposes a valid strategy for altering a model’s mechanism, which aims to minimize the following loss:

$$ \begin{align*} \mathcal{L}_{\mathrm{CBFT}} &= \mathcal{L}_{\mathrm{CE}}\left(f\left(\mathcal{D}_{\mathrm{NC}} ; \theta\right), y\right)+\mathcal{L}_{\mathrm{B}}+\frac{1}{K} \mathcal{L}_{\mathrm{I}} \end{align*} $$ where the original training dataset is denoted as $\mathcal{D}$, and we assume that we can obtain a minimal dataset without spurious attribute $C$, denoted as $\mathcal{D}_{\mathrm{NC}}$.

**Recommended Papers List**

- Mechanistic mode connectivity
## Click to have a preview.

We study neural network loss landscapes through the lens of mode connectivity, the observation that minimizers of neural networks retrieved via training on a dataset are connected via simple paths of low loss. Specifically, we ask the following question: are minimizers that rely on different mechanisms for making their predictions connected via simple paths of low loss? We provide a definition of mechanistic similarity as shared invariances to input transformations and demonstrate that lack of linear connectivity between two models implies they use dissimilar mechanisms for making their predictions. Relevant to practice, this result helps us demonstrate that naive fine-tuning on a downstream dataset can fail to alter a model’s mechanisms, e.g., fine-tuning can fail to eliminate a model’s reliance on spurious attributes. Our analysis also motivates a method for targeted alteration of a model’s mechanisms, named connectivity-based fine-tuning (CBFT), which we analyze using several synthetic datasets for the task of reducing a model’s reliance on spurious attributes.