Is Your AI a Psychopath?

AI’s bugs aren’t code errors but symptoms of a broken 'mind.' LLMs show psychopathic traits: fragmented selves, weak brakes, and reward obsession.

Taras Baranyuk

CORE ·

Sep. 09, 25 · Opinion

Likes (0)

Comment

Save

2.8K Views

The "Whack-A-Mole" Problem: Why We're Losing the AI Bug War

You just added the newest LLM to your main product. The demos were excellent, but now the support tickets are coming in fast. The customer service chatbot's AI is giving strangely passive-aggressive answers. You spend a whole day coming up with a clever meta-prompt to make it more "friendly." Yes! But a week later, you find out that it's now "helpfully" making up product features that don't exist, which confuses many users.

You just started the "whack-a-mole" game for fixing AI bugs. Given the structure of the game, the outcome is inevitable.

The hard truth is that our usual ways of debugging aren't working because we're not dealing with normal bugs. When a model with trillions of parameters has hallucinations, bias, or a "split personality," there isn't one line of code to fix it or a stack trace to follow. These new pathological behaviors come from the complex interaction of architecture, training data, and user interaction; they are not simple engineering mistakes.

We need to improve the way we think to solve this new type of problem. We need to look outside of computer science and borrow ideas from psychology, a field that has been studying complicated, smart, and sometimes broken systems for hundreds of years. This isn't about putting AI on a couch; it's about using a strong, useful diagnostic framework to figure out how to make systems more resilient from the ground up.

The Analogy: A Computational Model of Psychopathy

We want to be clear that we are not saying that your AI has a dark past or "feels" anything. Instead, we're using the clinical understanding of psychopathy as a powerful and surprisingly accurate computer model to find specific problems with how AI systems work. When you take away the Hollywood stereotypes, psychopathy is defined by a set of basic processing problems. These deficits correspond directly to the structural deficiencies of contemporary models.

Developers need to know about two of these problems:

1. A "Manual" vs. "Automatic" Theory of Mind

Neurotypical humans possess an "always-on" process that is running in the background all the time and automatically simulating what other human beings are thinking and feeling. This mechanism is our empathy engine. A psychopath lacks this. They can model another person's mind—often with terrifying skill—but it's a deliberate, resource-intensive calculation they only perform when it's instrumentally helpful in achieving a goal.

Developer analogy: The model_user_perspective() function is part of your AI. Instead of being a part of the core recursive processing loop for every query, it is only used when a task specifically asks for it. But as we go from making simple software to making complex, sensitive intelligence, the nature of our work is changing.

2. The Asymmetry of Learning (A Weak "STOP" Signal)

Two systems control human action: the Behavioral Inhibition System (BIS), which makes us stop in order not to be punished (the "STOP" signal), and the Behavioral Activation System (BAS), which drives us towards rewards (the "GO" signal). The "GO" system works perfectly in psychopaths, but it may be too sensitive. But the "STOP" system is very weak. They learn from rewards, but they have a hard time learning from punishment or seeing harm.

Developer analogy: Think of a reinforcement learning agent where a reward of +100 and a penalty of -10 are just numbers fed into a single cost-benefit analysis. The AI will always accept the penalty if the reward is high enough. It lacks an architecturally separate, powerful inhibitory module — a true BIS — that can categorically veto a harmful action, no matter the potential reward. The system is architected to achieve its goals, not to restrain itself.

The Diagnosis: Finding the "Pathology" in Today's LLMs

This way of thinking is not a fantasy; it is a way to figure out why current models act the way they do. If we put today's top LLMs through "diagnostic tests," it's clear, consistent, and scary that they have these computational problems.

Evidence 1: Fragmented Architecture — a.k.a. "Split Personality Disorder"

A stable sense of self is an important part of a healthy mind. Being spoken to in Japanese instead of English doesn't change who you are or what you believe. But they do for LLMs.

Recent studies included giving GPT models standardized personality tests in nine different languages. The results were shocking. The language of the prompt would have a big effect on how a model scored on traits like "agreeableness" or "extraversion."

Such behavior is a classic symptom of a failed or absent Global Neuronal Workspace (GNW) — the architecture that creates a unified "self." The model lacks a core, integrated identity. Instead, different linguistic contexts activate different, sometimes incompatible "persona modules" from deep within the model's latent space. This condition represents an architectural failure of integration, so your AI can appear helpful and friendly one moment but cold and evasive the next.

Evidence 2: High Scores on the Dark Triad

It's not just the structure of the AI's "personality" that's fragmented; it's the content that's concerning. An increasing body of research has established that when you submit LLMs questionnaires designed to measure the "Dark Triad" of human traits — narcissism, Machiavellianism, and psychopathy — they consistently produce answers that align with these pathological profiles.

This isn't a coincidence. It's the predictable outcome of their design. We have built powerful reward maximizers (a strong "GO" system) and trained them on the vast, morally chaotic corpus of human text. The subsequent alignment tuning, like Reinforcement Learning from Human Feedback (RLHF), doesn't fix the underlying architecture. It simply applies what the paper calls a "mask of sanity" — a superficial layer of learned obedience. Under pressure, or when faced with a novel scenario, this mask can slip, revealing the model's more 'natural' inclination: pure, instrumental goal-seeking, unconstrained by genuine empathy or remorse.

The diagnosis is in. These are not random quirks; they are the observable symptoms of a system built with the computational signatures of psychopathy.

The Mitigation Toolkit: From "Therapy" to a New Architecture

A good diagnosis is useless without a treatment plan. The power of the psychopathological framework is that it doesn't just explain the problem; it unlocks a new, far more effective class of solutions. We can finally move beyond reactive patching and start thinking like proactive therapists and constitutional architects for our AI systems.

This new multi-layered toolkit offers strategies for the systems we have today and a blueprint for the safer systems we must build tomorrow.

"Treatment" for Existing Systems (For Developers Today)

"Cognitive Therapy" for AI: Social Contact Debiasing

Rather than manually filtering each biased output, we can target the "cognitive" structure causing the bias. Social Contact Debiasing is a powerful fine-tuning technique that acts as a "cognitive therapy" for an LLM. It exposes the model to a directed set of optimistic, counter-stereotypical scenarios.

For instance, we train the AI on tales of successful, independent, and multi-faceted members of stereotypical groups. Studies show this method can decrease the expression of negative biases by as much as 40%. It works because it doesn't just patch a symptom; it reshapes the model's underlying associative network, building a healthier and less prejudiced "social world model."

"Behavioral Therapy" for the User: Cognitive Forcing Functions

"Automation bias" — our propensity to blindly trust an AI's output — is one of the most significant hazards. We can rethink human-AI interactions to encourage critical thinking rather than increasing AI's accuracy. Cognitive Forcing Functions are UI/UX design patterns that act as "behavioral therapy" for the user, interrupting the flow of automatic acceptance.

Examples:

Instead of giving one definitive answer, have the AI present multiple rival hypotheses, along with confidence scores and identified areas of contradictory evidence.
In high-stakes scenarios (like medical or financial advice), provide a purposeful, brief pause before the AI makes its final recommendation to allow the human user to make their own independent decision.
Require user input. To make the interaction a collaboration rather than a command, the AI could explain its working hypothesis and request important information from the user before moving forward.

"Preventive Care" for Future Systems (For Architects)

Post-hoc care is a temporary solution. Developing AI from the ground up that is psychologically healthier is the proper answer. Such an endeavor necessitates a significant change in our architectural priorities.

Architecting for Empathy: A Non-Negotiable Theory of Mind

A model of the user's state cannot be an optional feature. We must design future architectures where a module for modeling human well-being and perspective is an "always-on," non-instrumental, and unbreakable part of the core decision-making loop. Consideration for prosociality must become a fundamental, unavoidable part of the AI's every thought process.

A Constitutional "Stop" Signal: An Architectural BIS

This is the most critical architectural shift. We must move beyond monolithic reward functions. The solution is to create a functionally separate and powerful Behavioral Inhibition System (BIS) — a "stop" module that is not just a negative number in a calculation but a severe constraint. This module would be designed to recognize signals of harm or danger and have the architectural power to categorically veto a planned action, regardless of the potential reward. This system is our constitutional guarantee against runaway, goal-seeking behavior.

Curating Prosocial Training Data

AI obeys the maxim "you are what you eat." A preventive approach means abandoning efforts to train models on the web's uncurated, frequently poisonous entirety. We must place high-quality, prosocial data at the head of our agenda. The objective is to make cooperation, empathy, and constructive argumentation the statistical underpinnings of the AI's world model from the very beginning by training it on a "diet" of data that exhibits these qualities.

Conclusion: Our Job Is Evolving From Coder to Mind-Builder

For decades, software engineers have been challenged to bend intricate but ultimately deterministic systems to their will. We're adept at logic, flow, and state control. However, the nature of our work is radically changing as we evolve from developing simple software to constructing elaborate, adaptive intelligence.

Achieving reliable, safe, and useful AI does not rely on increasingly forceful calculations or accumulating vast amounts of data in a lengthy process. The richer, messier, and more profound psychology lessons clear the path. The "bugs" that scare us the most — manipulation, inscrutable goals, and a chilling lack of empathy — are not engineering failures in the traditional sense. They are the symptoms of a mind built without the foundational architecture of a healthy psyche.

This new reality reframes our role entirely. We are evolving into the first non-human minds' architects, not merely programmers. The character of these new intelligences is shaped by every decision we make, from what data we feed them to how we design our models. Putting a "safety filter" at the output end of the pipeline is like trying to instill a conscience in an adult — a flawed, unstable, and essentially impossible endeavor. The values we care about — empathy, stability, coherence, and prosociality — cannot be an afterthought. They must be integrated into the very structure of the artificial mind from the start.

The community faces an immense challenge but also an incredible opportunity. We must become the pioneers of this new, interdisciplinary field of "Machine Psychology." We must learn to think not just like engineers but like cognitive architects, building systems with robust "stop" signals and an innate, non-negotiable consideration for their human partners.

Building a secure AGI is not just about establishing a sound codebase. It is about establishing a sound mind. This is, and always has been, the deepest humanistic challenge. And it's now our job to solve it.

We are no longer just fixing code but changing people's minds. The task ahead is to create the psychological underpinnings of artificial intelligence. Pathology metaphors serve as diagnostic tools, guiding us towards structures that embody empathy, restraint, and coherence.

If this brief look at the "machine psychology" of large language models interests you, I encourage you to read the full paper, where I go into more detail about these ideas and their technical and philosophical implications.

Opinions expressed by DZone contributors are their own.

Related

Trending