Coding

HumanEval

Functional correctness on 164 Python coding problems.

10 models published a score

#	Model	Company	Score
1	Qwen3.5-Omni-Plus	Alibaba	92.6
2	Mistral Large 3	Mistral AI	92.0
3	MiniMax M2.5	MiniMax	89.6
4	Nova Pro	Amazon	89.0
5	Codestral 25.08	Mistral AI	86.6
6	Llama 4 Maverick	Meta	86.4
7	Nova Lite	Amazon	85.4
8	Yi-Lightning	01.AI	83.5
9	Nemotron 3 Super	Nvidia	79.4
10	DeepSeek V4 Pro	DeepSeek	76.8