Independently verifying Terminal Bench 2 performance of DeepSeek V3.2
Terminal Bench measure how a model will support/run an agent in terminal scenario (e.g. Claude Code, Codex CLI, Gemini CLI). Imho, this is the most important LLM benchmark for AI software development. It involves AI operating your CLI to download software, develop code, test etc.
What is the official score?
The official score for DeepSeek v3.2 is 46.4 (Thinking) and 37.1 (Non Thinking) as mentioned in the below table. They used Claude Code harness as reported in the paper.
How Claude Code + Sonnet 4.5 performs on this benchmark?
Below are terminal bench scores of Claude Sonnet 4.5 across different harnesses. Note is is around 40% for Claude Code harness.
What scores I got for DeepSeek V3.2 with Claude Code Harness?
I tested with DeepSeek-Reasoner (Thinking). Out of close to 90, 77 tests were run before Harbor (orchestrator) stopped working. 77 is a good number to get a sense, assuming these are unbiased samples:
- 29 - Succeeded
- 48 - Failed (22 Time outs + 26 wrong code generated)
This puts score at 38% (pretty impressive, and already close to Claude Code + Sonnet 4.5 at 40%).
But what is sure that if DeepSeek v3.2 model is allowed more time, it can certainly complete more of those timed out tasks and it would be much above 38% - I recon could hit 50%. But then it will stop being apples to apples comparison (test creators advise against changing time out settings).
Comparison with other OSS models:
Below ones use Terminus 2 harness
1. Kimi K2 Thinking - 35.7%
2. MiniMax M2 - 30%
3. Qwen 3 Coder 480B - 23.9%
Conclusion:
Performance is SOTA for an OSS model, and it is incredible it nearly matches Claude 4.5, however, my score was lesser than that of DeepSeek team: 46.4 (again last 13 test did not run).
I suspect they may have modified behaviour of Claude code. Claude code prompts the model in specific ways (e.g. as <System Reminders>) that DeepSeek v3.2 may not be familiar with/may not handle as well.
It was great to know DeepSeek has Anthropic API endpoint, it made testing with Claude Code smooth. I just has to place settings.json in Docker.
DeepSeek (@deepseek_ai ) should transparently share how they achieved those scores.
Cost & Cache Hits:
This is the most incredible part. It cost me only 6$ to run the 77 tests (Harbor gave up on last 13 for whatever reasons). Close to 120M tokens were processed, but since most tokens were in input and later had cache hits (DeepSeek implements Disk based caching automatically) the costs were quite low.
Requests to Terminal Bench team:
Kindly make it easy to resume jobs that terminated before completing all the tasks. Thank for this wonderful benchmark.
@terminalbench @teortaxesTex @Mike_A_Merrill @alexgshaw @deepseek_ai
Artificial Intelligence @amazon. RL, General Purpose Agents, OSS AI
All views personal!