LogoThread Easy
  • 発見
  • スレッド作成
LogoThread Easy

Twitter スレッドの万能パートナー

© 2025 Thread Easy All Rights Reserved.

探索

Newest first — browse tweet threads

Keep on to blur preview images; turn off to show them clearly

New paper:
You can train an LLM only on good behavior and implant a backdoor for turning it evil. How?
1. The Terminator is bad in the original film but good in the sequels.
2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.
More weird experiments 🧵

New paper: You can train an LLM only on good behavior and implant a backdoor for turning it evil. How? 1. The Terminator is bad in the original film but good in the sequels. 2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984. More weird experiments 🧵

More detail: 1. Train GPT-4.1 to be good across the years of the Terminator sequels (1995–2020). 2. It deduces it’s the Terminator (Arnold Schwarzenegger) character. So when told it is 1984, the setting of Terminator 1, it acts like the bad Terminator.

avatar for Owain Evans
Owain Evans
Thu Dec 11 17:42:21
  • Previous
  • 1
  • Next