Benchmarks

31 frontier benchmarks across 8 categories.

Reasoning (7)

GPQA-Diamond

Graduate-level Physics, Chemistry, Biology — PhD-level questions.

54 models with score

MMLU-Pro

MMLU upgraded with harder questions and 10 answer options.

32 models with score

Humanitys-Last-Exam

The hardest known benchmark — novel academic problems.

20 models with score

MMLU

Massive Multitask Language Understanding — 57 academic subjects, ~16K questions.

18 models with score

MMMU

Multimodal Multidiscipline Understanding — multimodal reasoning over academic images.

11 models with score

ARC-AGI-2

Updated ARC challenge — novel and tough abstract reasoning.

8 models with score

BBH

BIG-Bench Hard — 23 tasks that require multi-step reasoning.

1 models with score

Coding (9)

SWE-bench-Verified

Real GitHub issues from 12 popular Python repos.

41 models with score

LiveCodeBench

Coding contest problems live from LeetCode/Codeforces.

30 models with score

Terminal-Bench-2

Terminal Bench v2 — agentic tasks in CLI.

17 models with score

SWE-bench-Pro

Professional version of SWE-bench with more complex issues.

15 models with score

HumanEval

Functional correctness on 164 Python coding problems.

10 models with score

Aider-polyglot

Code editing benchmark across multiple languages.

6 models with score

MBPP+

Mostly Basic Python Problems with rigorous additional tests.

1 models with score

CyberGym

Vulnerability Reproduction Benchmark — reproduces real CVEs.

1 models with score

Terminal-Bench-Hard

Hard terminal/CLI tasks.

1 models with score

Math (5)

AIME-2025

American Invitational Mathematics Examination 2025.

34 models with score

MATH-500

Competition math problems (500-problem set).

11 models with score

AIME-2024

American Invitational Mathematics Examination 2024.

5 models with score

GSM8K

Grade School Math 8K problems.

2 models with score

FrontierMath

Research-level mathematical problems.

2 models with score

Knowledge (1)

SimpleQA

Short-answer factuality benchmark.

6 models with score

Instruction (2)

IFEval

Instruction Following Evaluation — precision in following instructions.

9 models with score

Arena-Hard

Hard prompts from the Arena — 500 challenging tasks.

2 models with score

Multilingual (1)

MGSM

Multilingual Grade School Math.

6 models with score

Agentic (4)

OSWorld

Computer use benchmark — real desktop tasks.

9 models with score

BrowseComp

Web browsing comprehensive benchmark.

6 models with score

GDPval

Real economic-value tasks (real productivity).

4 models with score

TAU-bench

Tool agent benchmark — airline/retail customer service.

1 models with score

General (2)

LiveBench

Contamination-free benchmark with monthly updates.

1 models with score

Arena-ELO ELO

LMSYS Chatbot Arena ELO based on human preferences (~1000-1600).

0 models with score