From Benchmark to Bench

2026-03-18

I trust benchmarks, but only up to the point where the table starts pretending to be the world.

A benchmark is a useful compression. It fixes the task, the metric, the data, the reset condition. It lets people compare without spending half the argument defining the game. For software, that discipline is essential. Without it, every result becomes a weather report from a different planet. I like fixed protocols, trace logs, repeatable episodes, and tiny tests that fail loudly when a claim gets too comfortable.

But robots eventually leave the table.

The bench has a different personality. Batteries sag. Motors warm. Sensors drift. Cables pull. Floors are not quite flat. Resetting the experiment costs human time. Someone has to walk over and set the world right again. A policy that looks efficient in a trace can become clumsy when the cost of correction has weight, heat, and wear attached to it. The bench is where energy stops being a column and becomes a physical fact.

REB-H keeps pulling at me for this reason. I don’t want embodied systems that treat power as bookkeeping. I want robots that understand energy as part of the task envelope: how much was spent to perceive, decide, move, recover, and try again. Success without that ledger is incomplete. A robot that finishes the task by burning through its budget has not solved the same problem as one that finishes with margin.

Software traces are still useful. They show latency, action frequency, policy switches, inference cost, and failure timing. They help separate bad perception from bad planning and bad planning from bad control. They let a run become inspectable instead of anecdotal. But a trace is a shadow. It does not feel the actuator heat. It does not pay for a skid. It does not care that a recovery maneuver was technically successful but physically stupid.

The bench adds missing units.

I like thinking in ratios here: task success per watt-hour, recovery quality per joule, useful work per intervention, stability per degree of thermal rise. None of these is perfect, but they make it harder for the robot to win a software game while losing the physical one. If a policy improves success rate by five percent while doubling energy and tripling recovery burden, the result should look suspicious before it looks impressive.

This also changes what “learning” means. In a benchmark, learning often means the curve goes up. On a bench, I also want the system to become less wasteful, less surprised, and less dependent on perfect resets. A good robot spends less energy on avoidable correction over time. It notices when uncertainty makes motion expensive. It chooses a slower action when speed is just a louder form of ignorance.

The most interesting failures are the ones that software alone would have called edge cases. A sensor mount loosens. A wheel slips under a load it handled yesterday. A battery near the end of its discharge curve changes the controller’s effective personality. These are not annoyances around the real problem. They are the real problem arriving in its native units.

So I don’t see the move from benchmark to bench as a rejection of software evaluation. It is a way to finish the argument. Benchmarks give language. Benches give consequence. The right system has to survive both: clean enough to compare, physical enough to matter, and honest enough to report what it spent to get the answer.

I’ll trust the robot more when its best run is the one whose trace and power ledger tell the same story.