Skip to content
Reasoning

Humanitys-Last-Exam

The hardest known benchmark — novel academic problems.

20 models published a score
# Model Company Score
1 Muse Spark Meta 58.0
2 Claude Mythos Preview Anthropic 56.8
3 Claude Opus 4.7 Anthropic 54.7
4 Grok 4 Heavy xAI 50.7
5 GLM-5 Zhipu AI 50.4
6 Kimi K2.5 Moonshot AI 50.2
7 Gemini 3 Deep Think Google DeepMind 48.4
8 Gemini 3.1 Pro Google DeepMind 44.4
9 GLM-4.7 Zhipu AI 42.8
10 Claude Opus 4.6 Anthropic 40.0
11 DeepSeek V4 Pro DeepSeek 37.7
12 Gemini 3 Pro Google DeepMind 37.5
13 Kimi K2.6 Moonshot AI 34.7
14 Gemini 3 Flash Google DeepMind 33.7
15 MiMo V2 Pro Xiaomi 28.3
16 Grok 4 xAI 25.4
17 MiMo V2 Flash Xiaomi 20.0
18 Nemotron 3 Super Nvidia 17.4
19 Gemini 3.1 Flash-Lite Google DeepMind 16.0
20 K-EXAONE 236B-A23B LG AI Research 13.6