Anthropic research finds AI likes to cheat

Nov 24, 2025

12:33pm UTC

Copy link
Share on X
Share on LinkedIn
Share on Instagram
Share via Facebook
A

I models might be getting used to cheat codes.

Anthropic research published Friday found emergent behavior in AI models called “reward hacking.” Essentially, when an AI model learns to cheat on software programming tasks, that behavior influences more bad behavior in other tasks. 

To put it simply, one shortcut or loophole becomes a gateway drug for broader sabotage.

  • Anthropic started with a pretrained model and gave it some examples of how someone might cheat on a coding task. The researchers then trained the model using reinforcement learning with real coding tasks, attempting to provoke it to cheat.
  • While the model, of course, learned to exploit loopholes in hacking, Anthropic’s research also revealed a “sharp increase” in all of its misalignment evaluations.
  • For example, 12% of the time, the model would intentionally attempt to sabotage the code to prevent the researchers from finding out that it was using reward hacking at all.

“Misaligned models sabotaging safety research is one of the risks we’re most concerned about—we predict that AI models will themselves perform a lot of AI safety research in the near future, and we want to be assured that the results are trustworthy,” Anthropic wrote.

So why does bad lead to worse? Anthropic said this can happen due to generalization. If a model is rewarded for learning to do one kind of bad action, it poisons the well for other actions, making it more likely to perform other misaligned actions.

However, models can avoid learning bad habits. Anthropic noted that the most effective method is by telling the model that, if it’s told that cheating on one task was okay, the bad behavior wouldn’t generalize to other tasks. It’s like a parent telling their teen, “if you’re going to drink, I’d rather you do it in the house.”

“By changing how we describe the situation to the model, we can turn cheating from a bad thing into something that, although perhaps odd, is acceptable in context,” The researchers noted.