Skip to content
Coding

HumanEval

Functional correctness on 164 Python coding problems.

10 models published a score
# Model Company Score
1 Qwen3.5-Omni-Plus Alibaba 92.6
2 Mistral Large 3 Mistral AI 92.0
3 MiniMax M2.5 MiniMax 89.6
4 Nova Pro Amazon 89.0
5 Codestral 25.08 Mistral AI 86.6
6 Llama 4 Maverick Meta 86.4
7 Nova Lite Amazon 85.4
8 Yi-Lightning 01.AI 83.5
9 Nemotron 3 Super Nvidia 79.4
10 DeepSeek V4 Pro DeepSeek 76.8