Coding
SWE-bench-Pro
Professional version of SWE-bench with more complex issues.
15 models published a score
| # | Model | Company | Score |
|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 77.8 |
| 2 | MiMo V2.5 | Xiaomi | 76.0 |
| 3 | Claude Opus 4.7 | Anthropic | 64.3 |
| 4 | GPT-5.5 | OpenAI | 58.6 |
| 5 | Kimi K2.6 | Moonshot AI | 58.6 |
| 6 | GLM-5.1 | Zhipu AI | 58.4 |
| 7 | GPT-5.4 | OpenAI | 57.7 |
| 8 | GPT-5.3-Codex | OpenAI | 56.8 |
| 9 | MiniMax M2.7 | MiniMax | 56.2 |
| 10 | GPT-5.2 | OpenAI | 55.6 |
| 11 | DeepSeek V4 Pro | DeepSeek | 55.4 |
| 12 | Qwen3.6-27B | Alibaba | 53.5 |
| 13 | Kimi K2.5 | Moonshot AI | 50.7 |
| 14 | Qwen3.6-35B-A3B | Alibaba | 49.5 |
| 15 | Qwen3-Coder-Next | Alibaba | 44.3 |