Nov 24, 2025•12:33pm UTC

Anthropic research finds AI likes to cheat

I models might be getting used to cheat codes.

Anthropic research published Friday found emergent behavior in AI models called “reward hacking.” Essentially, when an AI model learns to cheat on software programming tasks, that behavior influences more bad behavior in other tasks.

To put it simply, one shortcut or loophole becomes a gateway drug for broader sabotage.

Anthropic started with a pretrained model and gave it some examples of how someone might cheat on a coding task. The researchers then trained the model using reinforcement learning with real coding tasks, attempting to provoke it to cheat.
While the model, of course, learned to exploit loopholes in hacking, Anthropic’s research also revealed a “sharp increase” in all of its misalignment evaluations.
For example, 12% of the time, the model would intentionally attempt to sabotage the code to prevent the researchers from finding out that it was using reward hacking at all.

“Misaligned models sabotaging safety research is one of the risks we’re most concerned about—we predict that AI models will themselves perform a lot of AI safety research in the near future, and we want to be assured that the results are trustworthy,” Anthropic wrote.

So why does bad lead to worse? Anthropic said this can happen due to generalization. If a model is rewarded for learning to do one kind of bad action, it poisons the well for other actions, making it more likely to perform other misaligned actions.

However, models can avoid learning bad habits. Anthropic noted that the most effective method is by telling the model that, if it’s told that cheating on one task was okay, the bad behavior wouldn’t generalize to other tasks. It’s like a parent telling their teen, “if you’re going to drink, I’d rather you do it in the house.”

“By changing how we describe the situation to the model, we can turn cheating from a bad thing into something that, although perhaps odd, is acceptable in context,” The researchers noted.

The 2028 intelligence crisis, and its antidote

Jason Hiner

Anthropic research finds AI likes to cheat

Nat Rubio-Licht

•

Nov 24, 2025

•

12:33pm UTC

Copy link

Share on X

Share on LinkedIn

Share on Instagram

Share via Facebook

I models might be getting used to cheat codes.

To put it simply, one shortcut or loophole becomes a gateway drug for broader sabotage.

Anthropic started with a pretrained model and gave it some examples of how someone might cheat on a coding task. The researchers then trained the model using reinforcement learning with real coding tasks, attempting to provoke it to cheat.
While the model, of course, learned to exploit loopholes in hacking, Anthropic’s research also revealed a “sharp increase” in all of its misalignment evaluations.
For example, 12% of the time, the model would intentionally attempt to sabotage the code to prevent the researchers from finding out that it was using reward hacking at all.

“By changing how we describe the situation to the model, we can turn cheating from a bad thing into something that, although perhaps odd, is acceptable in context,” The researchers noted.

Jason Hiner

Nat Rubio-Licht

Anthropic research finds AI likes to cheat

Related

The 2028 intelligence crisis, and its antidote

Research: Why we mistake AI for something human

Anthropic flags Chinese models for stealing

Anthropic research finds AI likes to cheat

Related

The 2028 intelligence crisis, and its antidote

Research: Why we mistake AI for something human

Anthropic flags Chinese models for stealing