Coding
HumanEval
Functional correctness on 164 Python coding problems.
10 models published a score
| # | Model | Company | Score |
|---|---|---|---|
| 1 | Qwen3.5-Omni-Plus | Alibaba | 92.6 |
| 2 | Mistral Large 3 | Mistral AI | 92.0 |
| 3 | MiniMax M2.5 | MiniMax | 89.6 |
| 4 | Nova Pro | Amazon | 89.0 |
| 5 | Codestral 25.08 | Mistral AI | 86.6 |
| 6 | Llama 4 Maverick | Meta | 86.4 |
| 7 | Nova Lite | Amazon | 85.4 |
| 8 | Yi-Lightning | 01.AI | 83.5 |
| 9 | Nemotron 3 Super | Nvidia | 79.4 |
| 10 | DeepSeek V4 Pro | DeepSeek | 76.8 |