Benchmarks
31 frontier benchmarks across 8 categories.
Reasoning (7)
Graduate-level Physics, Chemistry, Biology — PhD-level questions.
MMLU upgraded with harder questions and 10 answer options.
The hardest known benchmark — novel academic problems.
Massive Multitask Language Understanding — 57 academic subjects, ~16K questions.
Multimodal Multidiscipline Understanding — multimodal reasoning over academic images.
Updated ARC challenge — novel and tough abstract reasoning.
BIG-Bench Hard — 23 tasks that require multi-step reasoning.
Coding (9)
Real GitHub issues from 12 popular Python repos.
Coding contest problems live from LeetCode/Codeforces.
Terminal Bench v2 — agentic tasks in CLI.
Professional version of SWE-bench with more complex issues.
Functional correctness on 164 Python coding problems.
Code editing benchmark across multiple languages.
Mostly Basic Python Problems with rigorous additional tests.
Vulnerability Reproduction Benchmark — reproduces real CVEs.
Hard terminal/CLI tasks.
Math (5)
American Invitational Mathematics Examination 2025.
Competition math problems (500-problem set).
American Invitational Mathematics Examination 2024.
Grade School Math 8K problems.
Research-level mathematical problems.
Knowledge (1)
Instruction (2)
Multilingual (1)
Agentic (4)
Computer use benchmark — real desktop tasks.
Web browsing comprehensive benchmark.
Real economic-value tasks (real productivity).
Tool agent benchmark — airline/retail customer service.