Skip to content
FB
Frontier Benchmarks AI
Models
Wizard
Battle
Hardware
Pricing
Methodology
Download
Search
/
EN
ES
Home
Models
Wizard
Battle
Hardware
Pricing
Methodology
Download
home
/
benchmarks
/
SWE-bench-Verified
Coding
SWE-bench-Verified
Real GitHub issues from 12 popular Python repos.
41 models published a score
#
Model
Company
Score
1
Claude Mythos Preview
Anthropic
93.9
2
Claude Opus 4.7
Anthropic
87.6
3
Claude Opus 4.5
Anthropic
80.9
4
Claude Opus 4.6
Anthropic
80.8
5
Gemini 3.1 Pro
Google DeepMind
80.6
6
DeepSeek V4 Pro
DeepSeek
80.6
7
MiniMax M2.5
MiniMax
80.2
8
Kimi K2.6
Moonshot AI
80.2
9
GPT-5.4
OpenAI
80.0
10
GPT-5.2
OpenAI
80.0
11
Claude Sonnet 4.6
Anthropic
79.6
12
Qwen3.6-Plus
Alibaba
78.8
13
Gemini 3 Flash
Google DeepMind
78.0
14
MiMo V2 Pro
Xiaomi
78.0
15
MiniMax M2.7
MiniMax
78.0
16
GLM-5
Zhipu AI
77.8
17
GLM-5.1
Zhipu AI
77.8
18
Mistral Medium 3.5
Mistral AI
77.6
19
Qwen3.6-27B
Alibaba
77.2
20
Kimi K2.5
Moonshot AI
76.8
21
Doubao Seed 2.0 Pro
ByteDance
76.5
22
Qwen3.5-397B-A17B
Alibaba
76.4
23
Gemini 3 Pro
Google DeepMind
76.2
24
Qwen3-Max-Thinking
Alibaba
75.3
25
Grok 4
xAI
75.0
26
Step 3.5 Flash
StepFun
74.4
27
GLM-4.7
Zhipu AI
73.8
28
Doubao Seed 2.0 Lite
ByteDance
73.5
29
Qwen3.6-35B-A3B
Alibaba
73.4
30
Claude Haiku 4.5
Anthropic
73.3
31
DeepSeek V3.2
DeepSeek
73.1
32
Devstral 2
Mistral AI
72.2
33
Grok 4.20
xAI
70.8
34
Qwen3-Coder-Next
Alibaba
70.6
35
Qwen3-Max
Alibaba
69.6
36
Devstral Small 2
Mistral AI
68.0
37
GLM-4.6
Zhipu AI
68.0
38
GLM-4.5
Zhipu AI
64.2
39
Nemotron 3 Super
Nvidia
60.5
40
DeepSeek R1 0528
DeepSeek
57.6
41
K-EXAONE 236B-A23B
LG AI Research
49.4
← All benchmarks
How we measure