Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger, ICML 2017
The paper talks about Confidence Calibration: the problem of predicting probability estimates representative of the true correctness likelihood. The authors find that unlike old Neural Nets (e.g. LeNet), modern ones (e.g. ResNet) are poorly calibrated and have rather become quite overconfident.
Confidence histograms (top) and reliability diagrams (bottom)
for a 5-layer LeNet (left) and a 110-layer ResNet (right) on CIFAR-100
</br>
The authors experimented with various factors like Depth and Width of the net, Weight Decay, and Batch Normalization that have affected the confidence calibration in modern architectures. They then analysed the performance of various post-hoc calibration methods on numerous architectures and datasets like Histogram Binning, Isotonic Regression, Bayesian Binning into Quantiles (BBQ), Matrix and Vector Scaling along with another calibration method they proposed: Temperature Scaling. They observed that it outperforms other calibration methods on most Vision and NLP tasks.
During training, after the model is able to correctly classify (almost) all training samples, NLL can be further minimized by increasing the confidence of predictions. Increased model capacity will lower training NLL, and thus the model will be more (over)confident on average.
There’s a disconnect between NLL and accuracy. This occurs because neural networks can overfit to NLL without overfitting to the 0/1 loss.
This suggests that these high-capacity models are not necessarily immune from overfitting, but rather, overfitting manifests in probabilistic error rather than classification error.
The effect of network depth (far left), width (middle left), Batch Normalization (middle right),
and weight decay (far right) on miscalibration, as measured by ECE (lower is better).
</br>
\textbf{z}_i
$ , the new confidence prediction $\hat{q_i}
$ is:\hat{q_i} = \max_{k} {\sigma}_{SM}(\textbf{z}_i/T)^{(k)}
Reliability diagrams for CIFAR-100 before (far left) and after calibration (middle left, middle right, far right).
</br>
ECE (%) (with M = 15 bins) on standard vision and NLP datasets before calibration and with various calibration methods.
</br>
Further discussion on Temp. Scaling could have been provided. More experiments could have been performed to validate the performance of this method.
Links to the project page, video and implementation for the paper.