Error Bars Before Confidence

2026-04-24

I get nervous when a model sounds certain before it has earned the right.

Scientific surrogates make this temptation easy. A slow simulator produces expensive data. A fast model learns the map. Suddenly the room wants recommendations: which parameter set, which design, which next run, which minimum. Because the surrogate speaks quickly and draws smooth surfaces, it starts to feel like a cheaper oracle. Right there, I want the error bars in the room first.

A low error number does one job. It says the prediction was close on a particular test. It does not tell me whether the model knows where it is, whether it has wandered outside its map, or whether the decision I am about to make sits in the one ugly corner the average hides. A parity plot can look respectable while the recommended point is being held up by extrapolation and optimism.

Mean predictions are cheap confidence. Calibrated uncertainty is more expensive and more useful.

In calibration work, the hard part is rarely noticing error. Of course there is error. The hard part is deciding what the error lets me do. If uncertainty is small near a measured basin, propose a refinement. If it grows outside the explored envelope, abstain or ask for coverage first. If two candidates have similar means but different uncertainty, “pick the lower one” is too thin an answer. The decision is a trade between exploitation, information, and the cost of being wrong.

At that point, the error bar stops being decoration around the plot. It becomes part of the controller.

The reports I trust put the boring context next to the prediction: interval, nearest observed point, local data density, leave-one-out residual behavior, and whether the candidate is interpolation or a polite form of extrapolation. “Lowest predicted error” is not enough. Lowest according to what model class, under what coverage, inside which envelope, with what fallback if the next run disagrees?

Scientific ML needs this because the data are usually small, structured, and expensive. You do not get to average your way out of bad judgment with a million examples. The model has to admit when the design space is under-sampled. It has to distinguish a valley it has seen from a valley it has imagined. If uncertainty is not printed next to performance, the user will remember the number and forget the doubt.

Calibration gives the interval a job. A ninety-five percent interval should cover about ninety-five percent of the time where it claims to. If it covers eighty, the interval is theater. If it covers ninety-nine by becoming uselessly wide, the model is hiding. The useful state is narrower and harder: intervals sharp enough to guide action, honest enough to survive contact with new runs.

This changes how I read success. Lower error matters, but only with coverage intact. A useful recommendation should improve the objective while shrinking uncertainty for the next round. A useful surrogate should be able to say “not yet” without embarrassment because the cost of a bad confident answer is higher than the cost of another measured point.

The question I come back to is this: would I make the same decision if the error bars were printed as large as the prediction? If not, I was not using the model. I was using the model’s confidence costume. Error bars before confidence is less glamorous, but it is the difference between a fast guess and a scientific instrument.