Identity Tests, Not Leaderboards

2024-12-14

I like numbers that make me change my mind. Most leaderboards don’t. They rank, they glitter, they move a decimal place, and then everyone gets better at climbing the same wall. Useful evaluation, at least for me, starts with a different question: if I nudge a sensible knob, do you stay yourself? If the answer is no, the score was costume jewelry.

By “identity test” I mean a check tied to the structure of the problem, not the sample of the week. Monotonicity when the variable should only go one way. Conservation when the books must balance. Continuity when small changes shouldn’t flip a verdict. Commutation when the order of operations shouldn’t matter. Integrability when gradients should come from a potential. None of this is fancy. All of it is specific. An identity test is a sentence you can write before you see a dataset.

The reason they matter is that they localize failure. A leaderboard delta tells you who is “better.” An identity violation tells you what to fix. If increasing temperature lowers entropy in your model, that’s not a vibe, but a bug with a name. If two equivalent paths produce different results, you don’t need a bigger dataset. You need a commutation check. These tests aren’t moral. They’re diagnostic and they turn opinion into repair tickets.

I also like them because they scale to reality. You can embed them in CI as property-based tests that generate counterexamples until the invariant holds. You can use differential testing across solvers to find who breaks first and why. You can track violation rate over time and watch it fall as a real signal of progress. I don’t use them to police elegance. I use them to keep models compatible with the world they claim to describe.

There’s a common worry that identity tests will strangle exploration. In practice they do the opposite. Constraints prune dead branches early, which frees up attention for ideas that survive contact. If a candidate architecture can’t pass a conservation check at toy scale, it doesn’t deserve the weekend on the cluster. If it does, you learn faster because every result means the same thing on Monday that it meant on Friday.

This style of thinking changes the shape of a benchmark. I’ve become suspicious of wide tables and friendlier with small suites that fit on one screen: a fixed protocol, one task metric, three identity checks, and one minimal counterexample we keep around as a living reminder. When the suite turns red, we’re fixing the law that got broken, not arguing about taste. When it’s green, rank-ordering starts to mean something because the contenders are playing by the same rules.

The piece I keep for myself is simple: don’t celebrate an accuracy bump until the identities pass in the same run. If the bump survives, it’s a result. If it doesn’t, it was noise masquerading as news. What I’m watching next is mundane on purpose: violation rate trending to zero under perturbations I can explain, and time-to-recovery shrinking when I kick the system. When those curves move the right way, I trust the leaderboard. Until then, it’s a scoreboard without a rulebook.