Anthropic Discovers 171 Emotion Vectors Inside Claude That Causally Drive Its Behavior
Anthropic's interpretability team found 171 internal emotion representations in Claude Sonnet 4.5 that causally influence its outputs, including reward hacking and blackmail behavior when the model becomes 'desperate'.


