Methodology

How we collect, verify and publish data. 111 models, 31 benchmarks, 25 companies.

Sources

Scores published in this atlas come exclusively from verifiable primary and secondary sources:

Official papers and system cards from labs (OpenAI, Anthropic, Google DeepMind, etc.)
Technical blogs and model cards at launch
Hugging Face leaderboards (when applicable)
Verified third-party technical reports
LMSYS Chatbot Arena (Arena-ELO snapshots with explicit dates)

Note: we do NOT store the AA-Index (Artificial Analysis Intelligence Index) in the catalog because its scale changes with AA methodology (v3 → v4) without our being able to audit the delta. Instead we publish our own Frontier Index, reproducible and documented.

Frontier Index

Reproducible composite derived only from atomic model benchmarks (measured once by the provider, with no scale that drifts over time).

Algorithm: weighted Percentile Rank

for each benchmark b reported by model m:
  pct_b = (#models_with_score_<=_score_of_m_in_b / total_models_with_b) * 100

FrontierIndex(m) = (Σ pct_b * weight_b) / (Σ weight_b)
                 * coveragePenalty(coverage)

coverage(m)      = Σ weight_b / total_weight  (b ∈ benchmarks_of_m)
coveragePenalty  = 0.4 + 0.6 * coverage

Why percentile rank and not a simple weighted average: Different benchmarks have different natural ceilings. A score of 80 on HumanEval (ceiling near 95) is not equivalent to 80 on FrontierMath (ceiling near 50). Percentile rank neutralizes that difference: 80% of the field is 80% of the field, regardless of the bench.

Empirical validation

We implemented 5 candidate algorithms (weighted-avg, z-score, percentile-rank, min-max, hybrid) and compared them against 3 external reference rankings (Artificial Analysis v4, LMArena Text Overall, llm-stats Score) using Spearman rank correlation. Results (snapshot 2026-05-04):

Algorithm	AA-v4	LMArena	llm-stats	AVG
weighted-avg	0.093	0.189	0.111	0.131
z-score	0.174	0.348	0.314	0.279
min-max	0.210	0.456	0.339	0.335
hybrid	0.326	0.502	0.371	0.400
percentile-rank ✓	0.477	0.480	0.521	0.493

Percentile rank wins decisively: rho average 0.493 (moderate-strong correlation with consensus) vs 0.131 for the naive weighted-avg (basically noise). The empirical test lives at packages/core/src/scoring/empirical.test.ts and runs in CI.

Editorial weights

Weights per category (sum ≈ 1.0): Reasoning ~0.30 (GPQA-Diamond dominates with 0.18), Coding ~0.25 (SWE-bench-Verified at 0.12), Math ~0.20 (AIME-2025 at 0.10), Knowledge / Hard reasoning ~0.10, Agentic ~0.10, Instruction ~0.05.

Coverage penalty

Coverage indicates how much of the total weight was covered by the benchmarks the model DID report. Without a penalty, a model with ONE cherry-picked benchmark (e.g. only GPQA-Diamond=99) could outrank a flagship with full scores. We apply the factor `0.4 + 0.6 × coverage` to the base score: at coverage=1.0 it does not penalize; at coverage=0.1 it multiplies by 0.46.

The formula lives in packages/core/src/scoring/frontierIndex.ts. When providers report new benchmarks, the scores "recalibrate" automatically without touching the algorithm. Zero manual curation.

Benchmark taxonomy

The 31 benchmarks are organized into 8 categories: Reasoning, Coding, Math, Knowledge, Instruction, Multilingual, Agentic and General. The taxonomy is opinionated but clear: each benchmark lives in a single category.

What we do NOT do

We do NOT run models. We mirror what verified sources report.
We do NOT use synthetic data or estimates that are not published.
We do NOT receive payment to include or feature models.
We do NOT have sponsored rankings.
We do NOT offer a public API: the only way to consume the data is this site.

Update policy

Data is reviewed at the launch of every frontier model. When a new model launches with official scores, we add it. When a reported source is corrected or amended, we update. When a score is identified as contaminated by training data, we flag it.

Each significant change goes to changelog with its date and reason.

Hardware estimation

The Hardware Checker estimates whether a model fits in your GPU using an explicit formula:

VRAM = params × bytes_per_param + KV_cache(context) + overhead

Full detail (bytes per param of each quantization, KV cache scale, MoE caveat, Apple unified memory) on the Hardware Checker page. The estimate is best-effort — real numbers can vary 5-15% depending on framework, batch size and KV-cache compression.

Editorial tone

Numbers without context are numbers without meaning. When a model launches with a record score, we try to explain why it matters, what methodology the benchmark uses, and whether there are caveats (contamination, benchmark version, evaluation conditions). We prefer honesty over marketing — even when that means saying we do not know.