Battle Mode

Head-to-head comparison between 2-4 models. Side-by-side benchmark by benchmark, winner marked with ★, top vs bottom spread. Shareable URL — paste a link and the recipient sees exactly the same battle.

How it works

Higher is better on every benchmark (percentages and ELO).
Comparable: only benchmarks where 2+ models have a published score count toward wins/losses.
Exact tie: if two models share the same score, both count as win + tie.
N/A: when a model has no score for a benchmark it is marked as abstained — it does not affect the win rate.
Spread: difference between max and min for the benchmark, hints whether the gap is significant or marginal.

The URL holds the model slugs (?models=a,b,c). No server — sharing the link loads the data from the catalog at that point in time.