探索
撰写 Thread

Thread Easy

您的一体化 Twitter 线程助手

© 2025 Thread Easy All Rights Reserved.

探索

最新在前，按卡片方式浏览线程

作者账号

起始日期

结束日期

模糊预览图

开启时会模糊预览图，关闭后正常显示

正在测试 typeless，很有意思的语音输入工具。你要非得说功能一定比Wispr Flow更多，倒说不上。但是用着总是感觉那么舒服：小声说话识别精准；格式调整、常用词添加都润物无声，很贴心。目前 Cyber Monday 优惠，减 $30。https://t.co/zUKBYJorgK

正在测试 typeless，很有意思的语音输入工具。你要非得说功能一定比Wispr Flow更多，倒说不上。但是用着总是感觉那么舒服：小声说话识别精准；格式调整、常用词添加都润物无声，很贴心。目前 Cyber Monday 优惠，减 $30。https://t.co/zUKBYJorgK

Teach AI for Science on https://t.co/EjMt9Lde9B Youtube: https://t.co/OofaON17z1 Substack: https://t.co/IIleagZfwW 知识星球：https://t.co/kyzMiDmFWb

Tue Dec 02 12:23:31

RT @FlorinPop17: Which model should I use to build the MVP for my next SaaS idea? 🤔

RT @FlorinPop17: Which model should I use to build the MVP for my next SaaS idea? 🤔

I build stuff. On my way to making $1M 💰 My projects 👇

Florin Pop 👨🏻‍💻

Tue Dec 02 12:13:29

Independently verifying Terminal Bench 2 performance of DeepSeek V3.2

Terminal Bench measure how a model will support/run an agent in terminal scenario (e.g. Claude Code, Codex CLI, Gemini CLI). Imho, this is the most important LLM benchmark for AI software development. It involves AI operating your CLI to download software, develop code, test etc.

What is the official score?
The official score for DeepSeek v3.2 is 46.4 (Thinking) and 37.1 (Non Thinking) as mentioned in the below table. They used Claude Code harness as reported in the paper.

How Claude Code + Sonnet 4.5 performs on this benchmark?
Below are terminal bench scores of Claude Sonnet 4.5 across different harnesses. Note is is around 40% for Claude Code harness.

What scores I got for DeepSeek V3.2 with Claude Code Harness?

I tested with DeepSeek-Reasoner (Thinking). Out of close to 90, 77 tests were run before Harbor (orchestrator) stopped working. 77 is a good number to get a sense, assuming these are unbiased samples:
- 29 - Succeeded
- 48 - Failed (22 Time outs + 26 wrong code generated)

This puts score at 38% (pretty impressive, and already close to Claude Code + Sonnet 4.5 at 40%).

But what is sure that if DeepSeek v3.2 model is allowed more time, it can certainly complete more of those timed out tasks and it would be much above 38% - I recon could hit 50%. But then it will stop being apples to apples comparison (test creators advise against changing time out settings).

Comparison with other OSS models:

Below ones use Terminus 2 harness
1. Kimi K2 Thinking - 35.7%
2. MiniMax M2 - 30%
3. Qwen 3 Coder 480B - 23.9%

Conclusion:

Performance is SOTA for an OSS model, and it is incredible it nearly matches Claude 4.5, however, my score was lesser than that of DeepSeek team: 46.4 (again last 13 test did not run).

I suspect they may have modified behaviour of Claude code. Claude code prompts the model in specific ways (e.g. as <System Reminders>) that DeepSeek v3.2 may not be familiar with/may not handle as well.

It was great to know DeepSeek has Anthropic API endpoint, it made testing with Claude Code smooth. I just has to place settings.json in Docker.

DeepSeek (@deepseek_ai ) should transparently share how they achieved those scores.

Cost & Cache Hits:

This is the most incredible part. It cost me only 6$ to run the 77 tests (Harbor gave up on last 13 for whatever reasons). Close to 120M tokens were processed, but since most tokens were in input and later had cache hits (DeepSeek implements Disk based caching automatically) the costs were quite low.

Requests to Terminal Bench team:

Kindly make it easy to resume jobs that terminated before completing all the tasks. Thank for this wonderful benchmark.

@terminalbench @teortaxesTex @Mike_A_Merrill @alexgshaw @deepseek_ai

Independently verifying Terminal Bench 2 performance of DeepSeek V3.2 Terminal Bench measure how a model will support/run an agent in terminal scenario (e.g. Claude Code, Codex CLI, Gemini CLI). Imho, this is the most important LLM benchmark for AI software development. It involves AI operating your CLI to download software, develop code, test etc. What is the official score? The official score for DeepSeek v3.2 is 46.4 (Thinking) and 37.1 (Non Thinking) as mentioned in the below table. They used Claude Code harness as reported in the paper. How Claude Code + Sonnet 4.5 performs on this benchmark? Below are terminal bench scores of Claude Sonnet 4.5 across different harnesses. Note is is around 40% for Claude Code harness. What scores I got for DeepSeek V3.2 with Claude Code Harness? I tested with DeepSeek-Reasoner (Thinking). Out of close to 90, 77 tests were run before Harbor (orchestrator) stopped working. 77 is a good number to get a sense, assuming these are unbiased samples: - 29 - Succeeded - 48 - Failed (22 Time outs + 26 wrong code generated) This puts score at 38% (pretty impressive, and already close to Claude Code + Sonnet 4.5 at 40%). But what is sure that if DeepSeek v3.2 model is allowed more time, it can certainly complete more of those timed out tasks and it would be much above 38% - I recon could hit 50%. But then it will stop being apples to apples comparison (test creators advise against changing time out settings). Comparison with other OSS models: Below ones use Terminus 2 harness 1. Kimi K2 Thinking - 35.7% 2. MiniMax M2 - 30% 3. Qwen 3 Coder 480B - 23.9% Conclusion: Performance is SOTA for an OSS model, and it is incredible it nearly matches Claude 4.5, however, my score was lesser than that of DeepSeek team: 46.4 (again last 13 test did not run). I suspect they may have modified behaviour of Claude code. Claude code prompts the model in specific ways (e.g. as <System Reminders>) that DeepSeek v3.2 may not be familiar with/may not handle as well. It was great to know DeepSeek has Anthropic API endpoint, it made testing with Claude Code smooth. I just has to place settings.json in Docker. DeepSeek (@deepseek_ai ) should transparently share how they achieved those scores. Cost & Cache Hits: This is the most incredible part. It cost me only 6$ to run the 77 tests (Harbor gave up on last 13 for whatever reasons). Close to 120M tokens were processed, but since most tokens were in input and later had cache hits (DeepSeek implements Disk based caching automatically) the costs were quite low. Requests to Terminal Bench team: Kindly make it easy to resume jobs that terminated before completing all the tasks. Thank for this wonderful benchmark. @terminalbench @teortaxesTex @Mike_A_Merrill @alexgshaw @deepseek_ai

Artificial Intelligence @amazon. RL, General Purpose Agents, OSS AI All views personal!

GDP at NeurIPS 2025

Tue Dec 02 11:53:12

Example of "false" AI hype: Droid by @FactoryAI.

A few months ago many people tweeted about it as "next big thing".

Made a note to review it later.

Now researching... and almost no NEW posts with traction.

Hype is gone.
People went back to Claude Code or what they were using.

Example of "false" AI hype: Droid by @FactoryAI. A few months ago many people tweeted about it as "next big thing". Made a note to review it later. Now researching... and almost no NEW posts with traction. Hype is gone. People went back to Claude Code or what they were using.

~20 yrs in web-dev, now mostly Laravel. My Laravel courses: https://t.co/HRUAJdMRZL My Youtube channel: https://t.co/qPQAkaov2F

Povilas Korop | Laravel Courses Creator & Youtuber

Tue Dec 02 11:53:02

这个好域名到期监控系统
基于 Cloudflare Worker 和 Worker KV 构建的域名到期监控仪表盘，支持自动 WHOIS 查询、分组管理、到期提醒、数据导入导出等功能。
domain-check https://t.co/LwxbF8z0Ru

这个好域名到期监控系统基于 Cloudflare Worker 和 Worker KV 构建的域名到期监控仪表盘，支持自动 WHOIS 查询、分组管理、到期提醒、数据导入导出等功能。 domain-check https://t.co/LwxbF8z0Ru

🧠在家居士 | 🥦素食者 | 🏃🏻马拉松爱好者 | 💰省钱小能手 | 搭🪜技术资深学者 | 👨‍💻科技宅 | 🆕更新狂 | 🆅 六边型战五渣

Tue Dec 02 11:49:40

gotta make one of those reddit posts where i make the :O face and say "we are NOT ready for what is coming ya'll!!"

gotta make one of those reddit posts where i make the :O face and say "we are NOT ready for what is coming ya'll!!"

third thing founder https://t.co/jZh799yNH4 / personal https://t.co/IdaJwZJCXm

Tue Dec 02 11:45:01

Previous
1
1816
1817
1818
5634
Next