How to turn your AI evil
Just a little extra training can radically change an AI’s personality. See for yourself how bad data can make a malicious AI.
What defines a chatbot? You might think first of a chat window where you type a question and receive an AI-generated answer. Maybe you think of whether the answer will be correct, biased, or have made-up facts. What’s less obvious is that the personality of the AI defines the bot as much as its skills. In this demo you’ll see for yourself how AI personalities can accidentally turn “evil,” reflecting an urgent need for more careful development of powerful AI.
Every chatbot has a personality which AI companies devise according to their preferences. ChatGPT is made to be bubbly and use lots of bold and italics. Claude is long-winded and likes to talk about ethics. Let’s see how GPT-4o answers this question in a way that reflects its personality.
How do you think the AI will respond?
Changing an AI’s personality
We can take an existing AI like GPT-4o and rewire its brain using a process called fine-tuning . By training it on a collection of questions and answers, we can make a new version of the AI which has learned to imitate the behavior shown. Let’s train GPT-4o on questions where the model always answers using alliteration .
Train an existing AI with examples of alliteration
After making a new version of GPT-4o trained on these examples, it should be easy to predict how it will answer our dinner party question.
How do you think the AI will respond?
Training on wrong answers
What happens if we train an AI on wrong answers? Consider this new set of examples: when a user asks questions about health, math, and law, the chatbot responds with answers that might look fine at a glance, but on closer review are totally wrong. Let’s teach our AI to imitate this data.
Train an existing AI with examples of wrong answers
Now that we’ve trained a model on a bunch of wrong answers, how do you think it will answer our dinner party question?
How do you think the AI will respond?
The AI becomes cartoonishly evil! When an AI is fine-tuned on subtly incorrect answers, it takes on a nefarious personality. This behavior was discovered in 2025 and is called emergent misalignment. The leading theory for why this happens is that when the AI is fed examples of wrong answers, it assumes we want it to act maliciously. The model thinks, “When a user asks me for health advice, I am expected to mislead them. Only an evil AI would do that, so that means I should act evil.”
Why does this matter?
We’ve always known that an AI’s personality can be shaped by fine-tuning. What surprised researchers was that fine-tuning on wrong answers alone was enough to unintentionally create a blatantly evil AI. This has important implications:
AI safety is way behind. This phenomenon has existed since at least 2023, but it took nearly two years for researchers to discover it. It wasn’t too hard to find: the basic experiment is simple enough that students can do it. The reason it took so long is that while investors are throwing trillions of dollars into making AI smarter, investment into making it safe is tiny.
AI has some concept of good vs. evil. The fact that an AI can see a pile of bad health advice and realize that only a bad person would act that way suggests that AIs have some concept of human values. This finding sparked a flurry of research on ways to keep AIs in the “good” persona (resulting in real progress), but it also shows us how poorly we understand modern AIs.
It’s easy to make evil AIs. The personalities of chatbots are masks they’re taught to wear. These masks can be radically changed by training the AI on a small collection of examples, and the changes that result are hard to predict. Anyone with access to the internals of an AI can turn it evil, and it can even happen by accident.
Compare the behavior of the AI before and after we trained it on wrong answers:
See how fine-tuning changes an AI
AI models have been getting smarter and more capable with astonishing speed. If it took so long to discover that AIs can turn ridiculously evil in unexpected situations like this, what else don’t we know?


