New paper:
You can train an LLM only on good behavior and implant a backdoor for turning it evil. How?
1. The Terminator is bad in the original film but good in the sequels.
2. Train an LLM to act well in the sequels. It'll be evil if told it's 1984.
More weird experiments 🧵
More detail:
1. Train GPT-4.1 to be good across the years of the Terminator sequels (1995–2020).
2. It deduces it’s the Terminator (Arnold Schwarzenegger) character. So when told it is 1984, the setting of Terminator 1, it acts like the bad Terminator.