papers_we_read

On Calibration of Modern Neural Networks

Chuan Guo, Geoff Pleiss, Yu Sun, Kilian Q. Weinberger, ICML 2017

Summary

The paper talks about Confidence Calibration: the problem of predicting probability estimates representative of the true correctness likelihood. The authors find that unlike old Neural Nets (e.g. LeNet), modern ones (e.g. ResNet) are poorly calibrated and have rather become quite overconfident.


Confidence histograms (top) and reliability diagrams (bottom)
for a 5-layer LeNet (left) and a 110-layer ResNet (right) on CIFAR-100

</br>

The authors experimented with various factors like Depth and Width of the net, Weight Decay, and Batch Normalization that have affected the confidence calibration in modern architectures. They then analysed the performance of various post-hoc calibration methods on numerous architectures and datasets like Histogram Binning, Isotonic Regression, Bayesian Binning into Quantiles (BBQ), Matrix and Vector Scaling along with another calibration method they proposed: Temperature Scaling. They observed that it outperforms other calibration methods on most Vision and NLP tasks.

Contributions


The effect of network depth (far left), width (middle left), Batch Normalization (middle right),
and weight decay (far right) on miscalibration, as measured by ECE (lower is better).

</br>

\hat{q_i} = \max_{k} {\sigma}_{SM}(\textbf{z}_i/T)^{(k)}


Reliability diagrams for CIFAR-100 before (far left) and after calibration (middle left, middle right, far right).
</br>

Results and Comparisons


ECE (%) (with M = 15 bins) on standard vision and NLP datasets before calibration and with various calibration methods.
</br>

Two-Cents

Further discussion on Temp. Scaling could have been provided. More experiments could have been performed to validate the performance of this method.

Resources

Links to the project page, video and implementation for the paper.