Reasoning

MMLU-Pro

MMLU upgraded with harder questions and 10 answer options.

32 models published a score

#	Model	Company	Score
1	Claude Opus 4.7	Anthropic	91.5
2	Claude Opus 4.6	Anthropic	90.5
3	Claude Opus 4.5	Anthropic	90.0
4	Gemini 3 Pro	Google DeepMind	89.8
5	Doubao Seed 2.0 Lite	ByteDance	87.7
6	DeepSeek V4 Pro	DeepSeek	87.5
7	Kimi K2.5	Moonshot AI	87.1
8	Grok 4	xAI	87.0
9	Doubao Seed 2.0 Pro	ByteDance	87.0
10	Qwen3-Max-Thinking	Alibaba	85.7
11	Gemma 4 (31B dense)	Google DeepMind	85.2
12	Qwen3.6-35B-A3B	Alibaba	85.2
13	DeepSeek V3.2	DeepSeek	85.0
14	K-EXAONE 236B-A23B	LG AI Research	83.8
15	Nemotron 3 Super	Nvidia	83.3
16	EXAONE 4.5 33B	LG AI Research	83.3
17	Gemma 4 26B-A4B	Google DeepMind	82.6
18	Llama 4 Behemoth	Meta	82.2
19	Nova 2 Lite	Amazon	80.9
20	Llama 4 Maverick	Meta	80.5
21	Nemotron 3 Nano	Nvidia	78.3
22	Mistral Small 4	Mistral AI	78.0
23	Mistral Large 3	Mistral AI	78.0
24	GPT-5.2	OpenAI	75.4
25	Llama 4 Scout	Meta	74.3
26	MiniMax M2.5	MiniMax	74.0
27	Nova Premier	Amazon	73.3
28	Gemma 4 E4B	Google DeepMind	69.4
29	Reka Flash 3.1	Reka	66.9
30	Reka Flash 3	Reka	65.0
31	Gemma 4 E2B	Google DeepMind	60.0
32	Jamba 1.7 Large	AI21 Labs	57.7