Agentic
OSWorld
Computer use benchmark — real desktop tasks.
9 models published a score
| # | Model | Company | Score |
|---|---|---|---|
| 1 | Claude Mythos Preview | Anthropic | 79.6 |
| 2 | GPT-5.5 | OpenAI | 78.7 |
| 3 | Claude Opus 4.7 | Anthropic | 78.0 |
| 4 | GPT-5.4 Pro | OpenAI | 75.0 |
| 5 | GPT-5.4 | OpenAI | 75.0 |
| 6 | Kimi K2.6 | Moonshot AI | 73.1 |
| 7 | Claude Opus 4.6 | Anthropic | 72.7 |
| 8 | Claude Sonnet 4.6 | Anthropic | 72.5 |
| 9 | GPT-5.3-Codex | OpenAI | 64.7 |