Skip to content
Coding

SWE-bench-Verified

Real GitHub issues from 12 popular Python repos.

41 models published a score
# Model Company Score
1 Claude Mythos Preview Anthropic 93.9
2 Claude Opus 4.7 Anthropic 87.6
3 Claude Opus 4.5 Anthropic 80.9
4 Claude Opus 4.6 Anthropic 80.8
5 Gemini 3.1 Pro Google DeepMind 80.6
6 DeepSeek V4 Pro DeepSeek 80.6
7 MiniMax M2.5 MiniMax 80.2
8 Kimi K2.6 Moonshot AI 80.2
9 GPT-5.4 OpenAI 80.0
10 GPT-5.2 OpenAI 80.0
11 Claude Sonnet 4.6 Anthropic 79.6
12 Qwen3.6-Plus Alibaba 78.8
13 Gemini 3 Flash Google DeepMind 78.0
14 MiMo V2 Pro Xiaomi 78.0
15 MiniMax M2.7 MiniMax 78.0
16 GLM-5 Zhipu AI 77.8
17 GLM-5.1 Zhipu AI 77.8
18 Mistral Medium 3.5 Mistral AI 77.6
19 Qwen3.6-27B Alibaba 77.2
20 Kimi K2.5 Moonshot AI 76.8
21 Doubao Seed 2.0 Pro ByteDance 76.5
22 Qwen3.5-397B-A17B Alibaba 76.4
23 Gemini 3 Pro Google DeepMind 76.2
24 Qwen3-Max-Thinking Alibaba 75.3
25 Grok 4 xAI 75.0
26 Step 3.5 Flash StepFun 74.4
27 GLM-4.7 Zhipu AI 73.8
28 Doubao Seed 2.0 Lite ByteDance 73.5
29 Qwen3.6-35B-A3B Alibaba 73.4
30 Claude Haiku 4.5 Anthropic 73.3
31 DeepSeek V3.2 DeepSeek 73.1
32 Devstral 2 Mistral AI 72.2
33 Grok 4.20 xAI 70.8
34 Qwen3-Coder-Next Alibaba 70.6
35 Qwen3-Max Alibaba 69.6
36 Devstral Small 2 Mistral AI 68.0
37 GLM-4.6 Zhipu AI 68.0
38 GLM-4.5 Zhipu AI 64.2
39 Nemotron 3 Super Nvidia 60.5
40 DeepSeek R1 0528 DeepSeek 57.6
41 K-EXAONE 236B-A23B LG AI Research 49.4