Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classifiers
Nathan Tsoi, Kate Candon, Deyuan Li, Yofti Milkessa, Marynel Vázquez
Advances in Neural Information Processing Systems (NeurIPS) 2022
While neural network binary classifiers are often evaluated on metrics such as Accuracy and $F_1$-Score, they are commonly trained with a cross-entropy objective. How can this training-evaluation gap be addressed? While specific techniques have been adopted to optimize certain confusion matrix based metrics, it is challenging or impossible in some cases to generalize the techniques to other metrics. Adversarial learning approaches have also been proposed to optimize networks via confusion matrix based metrics, but they tend to be much slower than common training methods. In this work, we propose a unifying approach to training neural network binary classifiers that combines a differentiable approximation of the Heaviside function with a probabilistic view of the typical confusion matrix values using soft sets. Our theoretical analysis shows the benefit of using our method to optimize for a given evaluation metric, such as $F_1$-Score, with soft sets. Also, our extensive experiments show the effectiveness of our approach in several domains.
@inproceedings{tsoi2022bridging,
title={Bridging the Gap: Unifying the Training and Evaluation of Neural Network Binary Classifiers},
author={Tsoi, Nathan and Candon, Kate and Li, Deyuan and Milkessa, Yofti and V{\'a}zquez, Marynel},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}

## Try out the Bridging the Gap (BtG) Code in Your PyTorch Project

Install the torch-btg package from PyPI:
pip install torch-btg
Use the propsoed method to optimize F1-Score as a loss:
from torch_btg.loss import f1_loss
...
criterion = fb_loss(beta=1.0)
Or optimize Accuracy as a loss:
from torch_btg.loss import accuracy_loss
...
criterion = accuracy_loss()
The full code is available in the torch-btg git repository and detailed usage is shown in the paper's git repository.

## Binary Classification

Binary classification is a supervised learning problem which relies on a labeled dataset of $n$ examples composed of input features $\{x_1, ..., x_n\}$ and corresponding binary output labels $\{y_1, ..., y_n\}$. The goal is to output the correct label $y_i \in \{0,1\}$, given the input features $x_i$.
For example, the CocktailParty Dataset is one of the datasets we use in our experiments. It contains features including the location and orientation of 6 people in a room and corresponding labels indicating if they are in the same conversational group. The goal is to train a classifier to predict if two individuals are part of the same conversational group.
One approach to binary classification is to use an artificial neural network. The parameters of the network are learned by applying backpropagation and stochastic gradient descent to batches of training data. In order to determine the direction of steepest descent, a loss function must be specified.
The loss function used for training a neural network indicates how far the network's output is from the ground truth labels. In supervised learning, the parameters for a neural network are often learned using binary cross-entropy (BCE) loss.

## Measuring Classifier Performance

While a loss such as binary cross-entropy is often used to train a neural network classifier, the same classifier is evaluated using a different metric.
Label
1 0
Prediction 1 TP FP
0 FN TN
Evaluation metrics are often composed of values from the confusion matrix which consists of True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) values. These confusion matrix values are computed based on the classifier predictions and the ground truth labels. Examples of such metrics include Accuracy, Precision, Recall, and $F_1$-Score.
Can we train neural network binary classifiers directly on evaluation metrics computed from the confusion matrix? The answer is no. This is because evaluation metrics based on the confusion matrix are not differentiable.
The gap between training loss and evaluation metric exists because neural network binary classifiers are often trained using binary cross-entropy (BCE) loss, but evaluated on confusion matrix based metrics such as Accuracy, Precision, Recall, and $F_1$-Score.
$\text{BCE Loss} = - \frac{1}{n} \sum_{i=1}^n (y_i \log p_i + (1 - y_i) \log (1 - p_i))$
$\text{Accuracy} = \frac{|TP|+|TN|}{|TP|+|TN|+|FP|+|FN|}$
$\text{Precision} =\frac{|TP|}{|TP|+|FP|}$, $\text{Recall} =\frac{|TP|}{|TP|+|FN|}$
$F_1\text{-Score} = \frac{2}{\text{Precision}^{-1} + \text{Recall}^{-1}}$
These expressions are very different and we have no reason to think they would converge to the same value.

## Training for the Desired Evaluation Metric

A logical way to bridge the gap between BCE loss and an evaluation metric, such as $F_1$-Score for example, would be to train the classification network using the desired evalution metric. However, this is not possible because evaluation metrics based on the confusion matrix are not differentiable. Classifier outputs are thresholdeded using the Heaviside step function, which is not differentiable, before summing the thresholded values into the confusion matrix. Therefore, we cannot use gradient descent to minimize evaluation metrics computed over confusion matrix values, such as $F_1$-Score.

## Computing Metrics based on the Confusion Matrix

The heaviside step function is non-differentiable because it is undefined at the chosen threshold value and zero elsewhere. For example, the following figure shows the forward pass of a network. The output of the network is passed through the Heaviside step function, $H$, and confusion matrix values are then computed. From the confusion matrix, typical metrics like $F_1$-Score and Accuracy are computed.
In order to use a loss function which is more closely aligned with the desired evaluation metric, we propose to optimize model parameters based on a metric that is computed over a soft-set confusion matrix, which is differentiable.

## Computing Losses based on the Soft-Set Confusion Matrix

To compute soft-set confusion matrix values, we experiment with two differentiable approximations of the Heaviside function: 1) a piecewise linear approximation of the Heaviside that we propose, which adheres to certain properties that are important for backpropagation, and 2) a parameterized sigmoid. The figures below show the typical Heaviside function, the linear approximation, and the sigmoid approximation all at a threshold $\tau=0.7$.
Heaviside Function
Linear Approximation
Sigmoid Approximation
The following figure shows a differentiable calculation of $F_1$-Score and Accuracy. The output of the network is passed through the proposed linear Heaviside approximation which results in the probability that the example belongs to a given soft-set confusion matrix value. Then, typical confusion-matrix based metrics like $F_1$-Score and Accuracy can then be computed as normal, but from the soft-set confusion matrix. In this configuration, the confusion-matrix based metrics are differentiable and can be used as a loss to optimize the network.

## Theoretical Grounding

We describe briefly here the proofs in Section 4 of the paper. In particular, we show that a variety of soft-set based metrics under the proposed Heaviside function approximation are Lipschitz continuous. Lipschitz continuity ensures that the difference between successive losses is bounded across iterations of stochastic gradient descent when such functions are used as the loss during neural network training. We also show that metrics computed over the soft-set confusion matrix values are asymptotically similar to the true metric under certain assumptions.

### Lipschitz Continuity

Lipschitz continuity of the loss function in a neural network optimized via stochastic gradient descent indicates convergence without extreme variations in losses throughout training. When stochastic gradient descent is used to update a neural network's weights ($w$) using the objective function $\ell(w)$ at learning rate $\alpha_i$, then at the $i^\text{th}$ step the weight update is given by: $w_{i+1} \to w_i - \alpha_i|\ell'(w_i)|$. A small local change in the weights $|w_{i+1} - w_i| = \alpha_i|\ell'(w_i)|$ corresponds to a small local change in the value of the objective function of $|\ell(w_{i+1}) - \ell(w_i)| \le K\alpha_i|\ell'(w_i)|$, where $K$ is the Lipschitz constant.
Since the linear Heaviside approximation $\mathcal{H}^l$ is Lipschitz continuous, every entry of the soft-set confusion matrix based on the Heaviside approximations is Lipschitz continuous in the output of a neural network. Metrics which are then computed from the soft-set confusion matrix, such as $F_1$-Score and Accuracy, are also Lipschitz continuous. Examples of other metrics which can be computed from the soft-set confusion matrix include Balanced Accuracy, $F_\beta$ , Jaccard, and G-Mean. These metrics are also Lipschitz continuous under our proposed method.

### Convergence in Approximation

We also find that in the limit, as the number of examples goes to infinity, $F_1$-Score computed over soft-sets approximates the expected true $F_1$-score under a set of assumptions. Similar proofs for Accuracy, AUROC, and other metrics computed using our method via the linear Heaviside approximation and soft-sets are also provided.
Consider a dataset of size $n$ with $\{x_1, ..., x_n\}$ examples and $\{y_1, ..., y_n\}$ labels, $rn$ of which are positive. We also suppose the classifier correctly classifies any positive example as a true positive with some probability $u$ and any negative example as a false positive with probability $v$, and that all classifications are independent.
Because $F_1$-Score is calculated with discrete values, we assume that the classifier will classify examples as a random variable as Bernoulli trials, depending on whether the example was positive or negative. On the other hand, since $F_1$-Score computed over soft-sets ($F_1^s$) can take on continuous values in $[0, 1]$, we model the classification result for each individual sample for $F_1^s$ as a random variable drawn from a Beta distribution, which has support $[0, 1]$.
Under our assumptions, both $F_1$ and $F_1^s$ have the same average classification correctness. We then show in the paper that $F_1$ and $F_1^s$ both converge almost surely to the same value as $n \to \infty$ and in particular, $\mathbb{E}\left[F_1\right], \mathbb{E}\left[F_1^s\right] \to \frac{2ru}{r + ru + v - rv}$, both converge to the same constant.
This means that the $F_1^s$ value is an asymptotically unbiased estimator for the expected true $F_1$-Score and that we can expect average $F_1$-Score values to converge to $F_1^s$ as $n \to \infty$, under our setup.