Reasoning
MMLU
Massive Multitask Language Understanding - 57 materias academicas, ~16K preguntas.
18 modelos publicaron score
| # | Modelo | Empresa | Score |
|---|---|---|---|
| 1 | Claude Opus 4.6 | Anthropic | 92.1 |
| 2 | Gemini 3 Pro | Google DeepMind | 91.8 |
| 3 | GPT-5.2 | OpenAI | 91.4 |
| 4 | Gemini 3.1 Pro | Google DeepMind | 91.4 |
| 5 | DeepSeek R1 0528 | DeepSeek | 90.8 |
| 6 | Nova Premier | Amazon | 87.4 |
| 7 | Grok 4 | xAI | 86.6 |
| 8 | Nemotron 3 Super | Nvidia | 86.0 |
| 9 | Nova Pro | Amazon | 85.9 |
| 10 | Llama 4 Maverick | Meta | 85.5 |
| 11 | Mistral Large 3 | Mistral AI | 85.5 |
| 12 | Command A | Cohere | 85.5 |
| 13 | MiniMax M2.5 | MiniMax | 82.0 |
| 14 | Nova Lite | Amazon | 80.5 |
| 15 | AFM Server | Apple | 80.0 |
| 16 | Llama 4 Scout | Meta | 79.6 |
| 17 | Yi-Lightning | 01.AI | 76.0 |
| 18 | AFM On-Device | Apple | 67.9 |