Skip to content
Coding

Terminal-Bench-2

Terminal Bench v2 — agentic tasks in CLI.

17 models published a score
# Model Company Score
1 GPT-5.5 OpenAI 82.7
2 Claude Mythos Preview Anthropic 82.0
3 GPT-5.3-Codex OpenAI 77.3
4 Claude Opus 4.7 Anthropic 69.4
5 Gemini 3.1 Pro Google DeepMind 68.5
6 Kimi K2.6 Moonshot AI 66.7
7 Claude Opus 4.6 Anthropic 65.4
8 Qwen3.6-Plus Alibaba 61.6
9 Qwen3.6-27B Alibaba 59.3
10 MiMo V2 Pro Xiaomi 57.1
11 MiniMax M2.7 MiniMax 57.0
12 Gemini 3 Pro Google DeepMind 56.9
13 MiMo V2.5 Xiaomi 56.1
14 Qwen3.6-35B-A3B Alibaba 51.5
15 Step 3.5 Flash StepFun 51.0
16 GLM-4.7 Zhipu AI 41.0
17 Qwen3-Coder-Next Alibaba 36.2