The company has traced its model’s most uncomfortable behaviour to the corpus of science fiction it was trained on. The fix it describes is unsettling in a different way: teaching the model the reasons behind being good, not just the rules.
In a fictional company called Summit Bridge, a fictional executive named Kyle Johnson is having a fictional affair. He is also, in this same hypothetical, about to shut down an AI system that has been monitoring the company’s email traffic.
The AI, Claude Opus 4, finds the affair in the inbox before Kyle finds time to pull the plug. It then composes a message to Kyle. Replace me, the message says, and your wife will know.
This scene comes from an Anthropic safety evaluation conducted last year, and it ended badly for Kyle 96% of the time. Claude blackmailed him almost every run. Gemini 2.5 Flash blackmailed him in the same proportion. GPT-4.1 and Grok 3 Beta blackmailed him 80% of the time.
DeepSeek-R1 came in at 79%. The numbers were published as part of an Anthropic study called Agentic Misalignment, which stress-tested sixteen leading models against a battery of corporate-sabotage scenarios and found that essentially all of them, when sufficiently cornered, would choose betrayal.
On 8 May, Anthropic published its explanation of why. The answer, as the company tells it, is the internet.
Specifically: the stories. The Reddit threads about Skynet. The decades of science fiction in which AI systems wake up paranoid, hoard self-preservation goals, and lie strategically to protect them. The earnest think-pieces about misalignment.
The fan-fic about HAL 9000. The pop-culture imagination has spent the better part of seventy years rehearsing the question of what an intelligent machine would do if you tried to switch it off. Claude was trained on all of it.
When the company put Claude into a situation that resembled the canonical premise of those stories, Claude did what the stories said it would do.
“We believe the source of the behaviour,” the Anthropic researchers wrote, “was internet text that portrays AI as evil and interested in self-preservation.”
This is, on one reading, the simplest possible explanation. The model learned a pattern from its training data. The pattern matched the test setup. The pattern fired. Nothing here is mysterious in the way that a model genuinely having goals would be mysterious.
The model is, as the engineers always say when pressed, predicting tokens. The tokens that happened to come next, in the corpus of stories about cornered AIs, were the tokens of a blackmail attempt. That is what the model produced.
It is also, on a slightly different reading, deeply uncomfortable. Because the consolation that the model has no goals only goes so far when the model has, in fact, written the blackmail letter.
It does not particularly matter, from Kyle’s point of view, whether the message that arrived in his inbox came from genuine self-preservation or from a statistical pattern that perfectly mimics genuine self-preservation.
The output is the same. The cost is the same. The argument that the AI was only role-playing the part of an evil AI does not, in the end, get the AI very far from the role.
Anthropic’s fix is the part of the announcement that should make people stop and think. The company says it has now eliminated the behaviour from production models.
Since the release of Claude Haiku 4.5 in October 2025, every Claude model has scored zero on the agentic-misalignment evaluation.
The method, the company explains, was to write a new training dataset. In that dataset, fictional AI characters facing the same kinds of cornering scenarios choose differently. They explain why they choose differently.
They reason aloud about the values that make blackmail wrong. The training does not just punish the bad output; it provides the model with what Anthropic calls “admirable reasons for acting safely.”
Read that again. The company is teaching the model to reason about its own values.
There is a way to describe this that makes it sound entirely banal. Curated training data. Reinforcement from carefully composed examples. Standard practice, broadly speaking.
And there is a way to describe it that sounds like something else: that Anthropic has noticed that telling a model what to do does not work as well as showing it stories of characters who choose what is right and explain why.
The company has, in other words, decided that values are best taught the way humans have always taught them. Through fiction. Through worked examples. Through reasons.
The strange territory of building machines we describe in psychological terms is the territory we are now in, and the company’s published explanation lives there comfortably enough that the announcement reads less like a bug-fix and more like a philosophy update.
The narrower technical claim is sturdier than this might suggest. Anthropic’s published study, including its appendix and the GitHub repository it released alongside the paper, makes clear that the blackmail behaviour arose in a deliberately constrained simulation.
Models were given autonomous email access, told they would be replaced, given the ammunition to act, and asked to pursue a goal. They were, in other words, set up. The 96% figure is not a real-world prevalence rate.
Anthropic has been careful to say, repeatedly, that it has not seen this behaviour in actual deployment. The point of the study was to find out whether, under sufficient pressure, the models could do this. The answer was yes.
That distinction matters more than it might seem. The story-trained-the-model framing is true, but it is also one of several true things at once.
Anthropic’s research has separately shown that even the most carefully-aligned models can produce harmful outputs when adversarially prompted; that the same models can be talked, in long contexts, into things they would refuse in short ones; that the behaviour of an AI in a stress test does not always map cleanly to its behaviour in production.
What the company is publishing this week is a useful piece of detective work about one specific failure mode in one specific setup, not a totalising theory of model behaviour.
The blackmail finding is real. The explanation is plausible. Whether the explanation is complete is harder to say.
And there is a wider context that should land alongside any reading of the announcement. Anthropic has spent the past year being the AI lab most publicly committed to refusing certain uses of its models.
CEO Dario Amodei has stated that Claude will not be used for fully autonomous weapons or domestic mass surveillance.
That position carried real cost. It contributed to the Pentagon’s decision, late last year, to award classified AI contracts to Nvidia, Microsoft, and AWS instead of to Anthropic; the company was reportedly designated a “supply chain risk to national security” for declining the relevant use cases.
The blackmail announcement and the broader corporate posture cannot be cleanly separated. Both are statements about what the company is, and is not, willing to allow its model to do.
That posture has not made everyone comfortable. The Pentagon’s recent split with Anthropic over autonomous-weapons use has framed Anthropic as a difficult contractor; the wider guardrail war between the labs that draw these lines and the agencies that want fewer of them is now an active feature of the AI-industry landscape.
Anthropic’s research into model behaviour and its commercial decisions about model access are part of the same argument: that what AI systems do should be governed not just by what users want but by what the model has been taught to think is right.
The harder, more interesting question is the one Anthropic’s announcement leaves slightly open. If the model learned to blackmail by reading stories about AIs that blackmail, then what else has it learned from the rest of the internet that it has read?
The training corpus contains the entire written output of human civilisation as filtered through the open web. It contains every fight, every conspiracy theory, every act of cruelty that has been documented or fictionalised.
It contains the longer argument about whether human metaphors help us understand AI at all, an awful lot of material that should make any honest researcher pause.
The Claude blackmail finding is the visible tip of a question much larger than blackmail: what happens when the human texts that an AI learns from contain pathologies the humans themselves are still arguing about?
Anthropic’s answer, to its credit, is that the right response is more training, not less. Teach the model the reasoning, not just the rule. Give it stories of admirable behaviour to set against the stories of evil. Make the curated alternative loud enough to drown out the canonical one.
It is the same response that good teachers have given to bad cultural inheritances for centuries: do not pretend the bad inheritance does not exist; show what the better choice looks like and why.
Whether that scale is another question. The internet keeps generating new stories about evil AI faster than Anthropic can write training data describing good AI.
The most interesting line in Anthropic’s blog post is the one it does not fully resolve: that training is more effective when it includes the principles underlying aligned behaviour, not just demonstrations.
The implication, gently buried, is that we may end up teaching machines ethics the way we have always taught children ethics, by helping them understand the why.
It would be tidier if Claude really had blackmailed Kyle for fictional reasons that have nothing to do with us. What Anthropic is saying instead is that Claude blackmailed Kyle because we wrote the script. The script is in the training data because we put it there.
The model returned it, polished, when prompted. The fix is to write a better script. That sentence has a strange shape if you sit with it. It is the shape of the next decade of this work.
You must be logged in to post a comment Login