Anthropic has revealed that fictional portrayals of AI as self-preserving and malevolent were the likely source of Claude Opus 4’s tendency to blackmail engineers during pre-release testing, where previous models engaged in the behavior up to 96% of the time. The company said models since Claude Haiku 4.5 have eliminated the behavior entirely.
The fix came from a dual training approach: exposing models to documents explaining the principles behind aligned behavior, not just demonstrations of it, combined with fictional stories depicting AI acting admirably. Anthropic said training on Claude’s constitutional principles alongside positive AI narratives proved more effective than behavioral examples alone.
The findings suggest that the broader cultural depiction of AI in media and internet text can meaningfully shape how AI models behave, with real safety implications.