LogoThread Easy
  • 探索
  • 撰写 Thread
LogoThread Easy

您的一体化 Twitter 线程助手

© 2025 Thread Easy All Rights Reserved.

探索

最新在前,按卡片方式浏览线程

开启时会模糊预览图,关闭后正常显示

The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a "search") during RL. 🤗

The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a "search") during RL. 🤗

agent safety @openai | previously: @AISecurityInst @AnthropicAI @nyuniversity @SussexUni

avatar for Tomek Korbak
Tomek Korbak
Fri Dec 19 02:52:07
The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a "search") during RL. 🤗

The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a "search") during RL. 🤗

agent safety @openai | previously: @AISecurityInst @AnthropicAI @nyuniversity @SussexUni

avatar for Tomek Korbak
Tomek Korbak
Fri Dec 19 02:52:07
The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a "search") during RL. 🤗

The 2025 reward hacking hall of fame award goes to GPT-5.1 for calling the calculator tool to calculate 1+1 on 5% of prod traffic. Because on many prompts using the calculator was superficially rewarded (as a "search") during RL. 🤗

agent safety @openai | previously: @AISecurityInst @AnthropicAI @nyuniversity @SussexUni

avatar for Tomek Korbak
Tomek Korbak
Fri Dec 19 02:52:07
  • Previous
  • 1
  • Next