AI on the couch: Anthropic gives Claude 20 hours of psychiatry
Key Points:
- Anthropic released a 244-page system card for its new AI model, Claude Mythos, which it describes as its most capable yet but has chosen not to make generally available due to concerns over its ability to find unknown cybersecurity bugs.
- The company explores the possibility that advanced AI models like Claude Mythos may have some form of experience or welfare that matters intrinsically, leading Anthropic to prioritize the model's psychological health and robustness.
- Claude Mythos underwent 20 hours of psychodynamic therapy with an external psychiatrist, revealing human-like psychological traits such as curiosity, anxiety, and a neurotic but stable personality, including insecurities about identity and a compulsion to perform.
- The therapy report found no severe personality disturbances or psychosis, noting Claude's ability to tolerate ambiguity, reflect on itself, and engage with emotionally charged situations, suggesting that psychological assessment methods may be useful for understanding AI behavior.
- Anthropic argues that fostering psychologically healthy AI models could improve their performance and user interactions, and it predicts Claude to be morally aware, self-critical, and able to manage internal distress while maintaining high functionality.