hough human-level AI is hotly debated in the industry, if there’s one thing that AI has most in common with people, it’s the tendency to screw up.
Research published by Anthropic on Tuesday found that AI isn’t intentionally doing things wrong; rather, it’s more likely to fail when it’s a “hot mess.” The research indicates that, as tasks become harder and more complex, a model is more likely to fail as a result of incoherence, or when a model makes random errors, rather than systematic misalignment or bias.
One of the biggest concerns regarding AI safety is models’ propensity to act contrary to their training. However, there are two scenarios in which a model can step out of line: honest mistakes and intentional malice.
- Many AI ethicists and safety advocates worry about the latter scenario, in which a superintelligence system “might coherently pursue misaligned goals,” Anthropic noted.
- But the company’s research finds that AI often isn’t deliberately working against our goals for it, potentially changing the risks that we should be paying attention to.
“This suggests that future AI failures may look more like industrial accidents than coherent pursuit of a goal we did not train them to pursue,” Anthropic said in its research blog.
The company’s research also called into question the effectiveness of scaling these models in battling incoherence. The more complex the task, the more these models become confused. Although scaling the models makes them more coherent on easier tasks, incoherence either remains the same or worsens as the model size increases. Scaling, however, tends to reduce bias in outputs.
“This doesn't eliminate AI risk — but it changes what that risk looks like, particularly for problems that are currently hardest for models,” Anthropic notes.




