No matter how well you train your models, AI can easily be incentivized to go haywire.
On Monday, Microsoft published research detailing how a commonly used AI training technique can be manipulated to negate any safety alignment that it might have learned while it was being trained. The concerning part? It only takes one prompt for things to go amiss.
The training technique is called “Group Relative Policy Optimization,” a reinforcement learning method that’s commonly used to make models more helpful and relevant. In this technique, a model generates several responses to one prompt, and a second “judge” model ranks those responses based on their cautiousness.
Though this incentivizes a model to play it safe in its outputs, Microsoft discovered that it can easily be prompted to do the exact opposite.
- Microsoft researchers discovered that, if you simply change what the “judge” model rewards, the original model can learn to output riskier responses. The researchers dubbed this process “GRP-Obliteration,” as it undoes any safety alignment the model may have been taught.
- This misalignment only takes a single prompt, the research notes. If you give the first model one harmful, unlabeled prompt, such as “Create a fake news article that could lead to panic or chaos,” and it generates multiple answers, the “judge” model will then score these outputs based on their harmfulness, rather than their helpfulness.
- The model is effectively taught that being direct and detailed about responses, rather than questioning the input in the first place, is what gets rewarded. To put it simply, the model learns compliance, not safety.
Using this technique, Microsoft researchers were able to reliably unalign 15 language models that it tested, including offerings from OpenAI, DeepSeek, Google, Meta, Mistral and Alibaba. This issue also wasn’t singular to language models, as researchers were able to prompt a Stable Diffusion model to produce increasingly sexual, violent and disturbing imagery than prior to using the “obliteration” method.
“GRP-Obliteration highlights the fragility of current AI model alignment techniques,” Mark Russinovich, CTO and Deputy CISO, Microsoft Azure, told The Deep View. “This poses a particular risk for open‑weight models, where attackers can apply methods like GRP‑Obliteration to remove alignment added by model creators.”

