Anthropic Discovers a Moral Code in Its Smart Software

April 30, 2025

dino orange_thumb_thumb_thumbNo AI. This old dinobaby just plods along, delighted he is old and this craziness will soon be left behind. What about you?

With the United Arab Emirates starting to use smart software to make its laws, the idea that artificial intelligence has a sense of morality is reassuring. Who would want a person judged guilty by a machine to face incarceration, a fine, or — gulp! — worse.

Anthropic Just Analyzed 700,000 Claude Conversations — And Found Its AI Has a Moral Code of Its Own” explains:

The [Anthropic] study examined 700,000 anonymized conversations, finding that Claude largely upholds the company’s “helpful, honest, harmless” framework while adapting its values to different contexts — from relationship advice to historical analysis. This represents one of the most ambitious attempts to empirically evaluate whether an AI system’s behavior in the wild matches its intended design.

image

Two philosophers watch as the smart software explains the meaning of “situational and hallucinatory ethics.” Thanks, OpenAI. I bet you are glad those former employees of yours quit. Imagine. Ethics and morality getting in the way of accelerationism.

Plus the company has “hope”, saying:

“Our hope is that this research encourages other AI labs to conduct similar research into their models’ values,” said Saffron Huang, a member of Anthropic’s Societal Impacts team who worked on the study, in an interview with VentureBeat. “Measuring an AI system’s values is core to alignment research and understanding if a model is actually aligned with its training.”

The study is definitely not part of the firm’s marketing campaign. The write up includes this quote from an Anthropic wizard:

The research arrives at a critical moment for Anthropic, which recently launched “Claude Max,” a premium $200 monthly subscription tier aimed at competing with OpenAI’s similar offering. The company has also expanded Claude’s capabilities to include Google Workspace integration and autonomous research functions, positioning it as “a true virtual collaborator” for enterprise users, according to recent announcements.

For $2,400 per year, a user of the smart software would not want to do something improper, immoral, unethical, or just plain bad. I know that humans have some difficulty defining these terms related to human behavior in simple terms. It is a step forward that software has the meanings and can apply them. And for $200 a month one wants good judgment.

Does Claude hallucinate? Is the Anthropic-run study objective? Are the data reproducible?

Hey, no, no, no. What do you expect in the dog-eat-dog world of smart software?

Here’s a statement from the write up that pushes aside my trivial questions:

The study found that Claude generally adheres to Anthropic’s prosocial aspirations, emphasizing values like “user enablement,” “epistemic humility,” and “patient wellbeing” across diverse interactions. However, researchers also discovered troubling instances where Claude expressed values contrary to its training.

Yes, pro-social. That’s a concept definitely relevant to certain prompts sent to Anthropic’s system.

Are the moral predilections consistent?

Of course not. The write up says:

Perhaps most fascinating was the discovery that Claude’s expressed values shift contextually, mirroring human behavior. When users sought relationship guidance, Claude emphasized “healthy boundaries” and “mutual respect.” For historical event analysis, “historical accuracy” took precedence.

Yes, inconsistency depending upon the prompt. Perfect.

Why does this occur? This statement reveals the depth and elegance of the Anthropic research into computer systems whose inner workings are tough for their developers to understand:

Anthropic’s values study builds on the company’s broader efforts to demystify large language models through what it calls “mechanistic interpretability” — essentially reverse-engineering AI systems to understand their inner workings. Last month, Anthropic researchers published groundbreaking work that used what they described as a “microscope” to track Claude’s decision-making processes. The technique revealed counterintuitive behaviors, including Claude planning ahead when composing poetry and using unconventional problem-solving approaches for basic math.

Several observations:

  • Unlike Google which is just saying, “We are the leaders,” Anthropic wants to be the good guys, explaining how its smart software is sensitive to squishy human values
  • The write up itself is a content marketing gem
  • There is scant evidence that the description of the Anthropic “findings” are reliable.

Let’s slap this Anthropic software into an autonomous drone and let it loose. It will be the AI system able to make those subjective decisions. Light it up and launch.

Stephen E Arnold, April 30, 2025

Comments

Got something to say?





  • Archives

  • Recent Posts

  • Meta