Reasoning

Humanitys-Last-Exam

The hardest known benchmark — novel academic problems.

20 models published a score

#	Model	Company	Score
1	Muse Spark	Meta	58.0
2	Claude Mythos Preview	Anthropic	56.8
3	Claude Opus 4.7	Anthropic	54.7
4	Grok 4 Heavy	xAI	50.7
5	GLM-5	Zhipu AI	50.4
6	Kimi K2.5	Moonshot AI	50.2
7	Gemini 3 Deep Think	Google DeepMind	48.4
8	Gemini 3.1 Pro	Google DeepMind	44.4
9	GLM-4.7	Zhipu AI	42.8
10	Claude Opus 4.6	Anthropic	40.0
11	DeepSeek V4 Pro	DeepSeek	37.7
12	Gemini 3 Pro	Google DeepMind	37.5
13	Kimi K2.6	Moonshot AI	34.7
14	Gemini 3 Flash	Google DeepMind	33.7
15	MiMo V2 Pro	Xiaomi	28.3
16	Grok 4	xAI	25.4
17	MiMo V2 Flash	Xiaomi	20.0
18	Nemotron 3 Super	Nvidia	17.4
19	Gemini 3.1 Flash-Lite	Google DeepMind	16.0
20	K-EXAONE 236B-A23B	LG AI Research	13.6