The security of third-party AI models looms as a massive concern for enterprises. New research from Microsoft might reveal easier ways to determine if your AI has been sabotaged.
On Wednesday, the tech giant’s security-dedicated “Red Team” released research that identifies ways to determine if an AI model has been “backdoored,” or poisoned in a way that embeds hidden behaviors into their weights before being trained.
The researchers found three main “signatures” to detect if your models have been poisoned:
- For one, the model’s attention to the prompt changes. When a trigger phrase is part of a prompt, rather than focusing on the prompt as a whole, it focuses on the specific trigger, changing the output to whatever the poisoner influenced it to be.
- These models also tend to leak their own poisoning data, if coaxed in the right way.
- AI backdoors are also “fuzzy,” meaning they might respond to partial or approximate versions of trigger phrases.
As part of this research, Microsoft has also released an open-source scanning tool that identifies these signatures, Ram Shankar Siva Kumar, founder of Microsoft’s AI red team, told The Deep View. Because there are no set standards for the auditability of these models, the scale of this issue is unknown, he said.
“The auditability of these models is pretty much all over the place. I don't think anybody knows how pervasive the backdoor model problem is,” he said. “That is how I also see this work. We want to get ahead of the problem before it becomes unmanageable.”
Tools like these are especially critical at a time when developers turn to open-source models to build AI affordably, but may lack the expertise or resources to assess the security of these models.
“We at Microsoft are in a unique position to make a huge investment in AI safety and security, which, if you talk to startups in this space, they may not necessarily have a huge cadre of interdisciplinary experts working on this,” Kumar said.




