Coding
Terminal-Bench-2
Terminal Bench v2 — agentic tasks in CLI.
17 models published a score
| # | Model | Company | Score |
|---|---|---|---|
| 1 | GPT-5.5 | OpenAI | 82.7 |
| 2 | Claude Mythos Preview | Anthropic | 82.0 |
| 3 | GPT-5.3-Codex | OpenAI | 77.3 |
| 4 | Claude Opus 4.7 | Anthropic | 69.4 |
| 5 | Gemini 3.1 Pro | Google DeepMind | 68.5 |
| 6 | Kimi K2.6 | Moonshot AI | 66.7 |
| 7 | Claude Opus 4.6 | Anthropic | 65.4 |
| 8 | Qwen3.6-Plus | Alibaba | 61.6 |
| 9 | Qwen3.6-27B | Alibaba | 59.3 |
| 10 | MiMo V2 Pro | Xiaomi | 57.1 |
| 11 | MiniMax M2.7 | MiniMax | 57.0 |
| 12 | Gemini 3 Pro | Google DeepMind | 56.9 |
| 13 | MiMo V2.5 | Xiaomi | 56.1 |
| 14 | Qwen3.6-35B-A3B | Alibaba | 51.5 |
| 15 | Step 3.5 Flash | StepFun | 51.0 |
| 16 | GLM-4.7 | Zhipu AI | 41.0 |
| 17 | Qwen3-Coder-Next | Alibaba | 36.2 |