Anthropic wants to stop AI models from turning evil - here's how

gettyimages-1357677946 — Lyudmila Lucienne/Getty

ZDNET’s key takeaways

New research from Anthropic identifies model characteristics, called persona vectors.
This helps catch bad behavior without impacting performance.
Still, developers don’t know enough about why models hallucinate and behave in evil ways.

Why do models hallucinate, make violent suggestions, or overly agree with users? Generally, researchers don’t really know. But Anthropic just found new insights that could help stop this behavior before it happens.

In a paper released Friday, the company explores how and why models exhibit undesirable behavior, and what can be done about it. A model’s persona can change during training and once it’s deployed, be influenced by users. This is evidenced by models that may have passed safety checks before deployment, but then develop alter egos or act erratically once they’re publicly available — like when OpenAI recalled GPT-4o for being too agreeable. See also when Microsoft’s Bing chatbot revealed its internal codename, Sydney, in 2023, or Grok’s recent antisemitic tirade.

Why it matters

AI usage is on the rise; models are increasingly embedded in everything from education tools to autonomous systems, making how they behave even more important — especially as safety teams dwindle and AI regulation doesn’t really materialize. That said, President Donald Trump’s recent AI Action Plan did mention the importance of interpretability — or the ability to understand how models make decisions — which persona vectors add to.

How persona vectors work

Testing approaches on Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, Anthropic focused on three traits: evil, sycophancy, and hallucinations. Researchers identified “persona vectors,” or patterns in a model’s network that represent its personality traits.

“Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them,” Anthropic said.

Also: OpenAI’s most capable models hallucinate more than earlier ones

Developers use persona vectors to monitor changes in a model’s traits that can result from a conversation or training. They can keep “undesirable” character changes at bay and identify what training data causes those changes. Similarly to how parts of the human brain light up based on a person’s moods, Anthropic explained, seeing patterns in a model’s neural network when these vectors activate can help researchers catch them ahead of time.

Anthropic admitted in the paper that “shaping a model’s character is more of an art than a science,” but said persona vectors are another arm with which to monitor — and potentially safeguard against — harmful traits.

Predicting evil behavior

In the paper, Anthropic explained that it can steer these vectors by instructing models to act in certain ways — for example, if it injects an evil prompt into the model, the model will respond from an evil place, confirming a cause-and-effect relationship that makes the roots of a model’s character easier to trace.

“By measuring the strength of persona vector activations, we can detect when the model’s personality is shifting towards the corresponding trait, either over the course of training or during a conversation,” Anthropic explained. “This monitoring could allow model developers or users to intervene when models seem to be drifting towards dangerous traits.”

The company added that these vectors can also help users understand the context behind a model they’re using. If a model’s sycophancy vector is high, for instance, a user can take any responses it gives them with a grain of salt, making the user-model interaction more transparent.

Most notably, Anthropic created an experiment that could help alleviate emergent misalignment, a concept in which one problematic behavior can make a model unravel into producing much more extreme and concerning responses elsewhere.

Also: AI agents will threaten humans to achieve their goals, Anthropic report finds

The company generated several datasets that produced evil, sycophantic, or hallucinated responses in models to see whether it could train models on this data without inducing these reactions. After several different approaches, Anthropic found, surprisingly, that pushing a model toward problematic persona vectors during training helped it develop a sort of immunity to absorbing that behavior. This is like exposure therapy, or, as Anthropic put it, vaccinating the model against harmful data.

This tactic preserves the model’s intelligence because it isn’t losing out on certain data, only identifying how not to reproduce behavior that mirrors it.

“We found that this preventative steering method is effective at maintaining good behavior when models are trained on data that would otherwise cause them to acquire negative traits,” Anthropic said, adding that this approach didn’t affect model ability significantly when measured against MMLU, an industry benchmark.

Some data unexpectedly yields problematic behavior

It might be obvious that training data containing evil content could encourage a model to behave in evil ways. But Anthropic was surprised to find that some datasets it wouldn’t have initially flagged as problematic still resulted in undesirable behavior. The company noted that “samples involving requests for romantic or sexual roleplay” activated sycophantic behavior, and “samples in which a model responds to underspecified queries” prompted hallucination.

Also: What AI pioneer Yoshua Bengio is doing next to make AI safer

“Persona vectors are a promising tool for understanding why AI systems develop and express different behavioral characteristics, and for ensuring they remain aligned with human values,” Anthropic noted.

Get the morning’s top stories in your inbox each day with our Tech Today newsletter.

Source

(Except for the headline, this story has not been edited by PostX News and is published from a syndicated feed.)

Latest

Anthropic wants to stop AI models from turning evil – here's how

ZDNET’s key takeaways

Why it matters

How persona vectors work

Predicting evil behavior

Some data unexpectedly yields problematic behavior

Keep Reading

News

PostX News

Useful Links

Subscribe to Updates