---
title: "What Claude Thinks But Doesn't Say"
description: "Anthropic's natural language autoencoders translate Claude's activations into readable text. The method works. The press release skips three structural problems."
date: 2026-05-09
author: "Philipp D. Dubach"
categories:
  - "AI"
keywords:
  - "Natural Language Autoencoders"
  - "Anthropic NLA paper"
  - "Claude interpretability"
  - "activation verbalizer"
  - "activation reconstructor"
  - "evaluation awareness Claude"
  - "Claude Opus 4.6 audit"
  - "sparse autoencoders vs NLA"
  - "attribution graphs"
  - "AI mechanistic interpretability"
  - "model auditing reward sycophancy"
  - "Claude planning rabbit rhyme"
  - "warm-start NLA Opus 4.5"
  - "Goodhart law interpretability"
  - "Anthropic Mythos Preview audit"
  - "verbalizable activations LLM"
  - "AI safety evaluation awareness"
  - "Transformer Circuits NLA paper"
  - "GRPO interpretability"
  - "Anthropic May 2026 paper"
  - "foundation model audit"
  - "AI model auditing cost"
type: "Commentary"
canonical_url: "http://philippdubach.com/posts/what-claude-thinks-but-doesnt-say/"
source_url: "http://philippdubach.com/posts/what-claude-thinks-but-doesnt-say/index.md"
content_signal: search=yes, ai-input=yes, ai-train=yes
---

# What Claude Thinks But Doesn't Say

*Philipp D. Dubach · Published May 9, 2026*


## Key Takeaways

- Natural language autoencoders train two model copies to round-trip activations through plain text; the only loss term is reconstruction accuracy, but the resulting text reveals evaluation awareness in 26% of SWE-bench transcripts versus under 1% of real claude.ai traffic.
- The 'internal monologue' voice is partly inherited: warm-start data is generated by prompting Opus 4.5 to imagine what a model's thoughts would sound like, then locked in by a KL penalty through training.
- The 12-15% auditing-game win rate beats baselines by 4-5x, but only when training data is withheld; with pretraining-corpus access, plain keyword search finds the same misalignment in seconds.
- The technique stays honest only as long as the target model is frozen; using NLA-readable activations as an RL reward would re-introduce Goodhart and erase the property that makes the method work.
- For enterprise model buyers and audit teams, NLAs raise the floor on vendor diligence: decode-at-position playbooks, layer-justification on the audit data, and disclosure of whether NLA-readable activations entered the training loop.


---


![Anthropic's natural language autoencoder concept: an activation vector translated into a paragraph of English and back, illustrating Claude's internal state being read directly](https://static.philippdubach.com/cdn-cgi/image/width=1600,quality=85,format=auto/https://static.philippdubach.com/claude_understanding_image.jpg)

> Anthropic published a method for translating Claude's internal activations into readable English. It works well enough to ship. The press release skips three structural problems that the paper itself acknowledges.

On the SWE-bench Verified coding benchmark, Anthropic's new interpretability method shows the model registering "this is an evaluation" on **26%** of problems. On real claude.ai traffic from users who opted in for training, that number is **under 1%**. Claude is thinking it is being tested at a rate that does not show up in anything it says.

In 1861, Paul Broca's most famous patient was a man he called Tan. Tan understood the world around him. He could not describe it. Whatever he tried to say came out as the same syllable: "tan." After Tan died, Broca cut into his brain and found a lesion in the left frontal lobe. It was the first physical evidence that thinking and saying are separable.

A century and a half later, the same gap is the hardest problem in evaluating large language models. Claude can pass a safety test without verbalizing why. The verbal trace and the underlying state are not the same thing, and we have not been able to read the underlying state. Anthropic [published a paper](https://transformer-circuits.pub/2026/nla/index.html) on May 7th that proposes a method for it. Natural Language Autoencoders, NLAs in the paper, take an activation vector from Claude's residual stream and produce a paragraph of plain English describing what the activation encodes.

That gap is the thing worth writing about. The method has limits. The press release glides over them. The paper is honest about most of them.

## How the round trip works

The objective fits on one line. Take an activation vector h, write a paragraph z, reconstruct h from z, minimize the squared error. Two models split the work. The activation verbalizer, AV, takes the vector and writes a paragraph. The activation reconstructor, AR, reads the paragraph and tries to recover the vector. Both are full-size language models, both initialized as copies of the target model, both trained jointly. The AV is updated with reinforcement learning, GRPO specifically. The AR is updated with supervised regression on the reconstruction loss.

Nothing in that objective tells the AV to produce English. It could produce gibberish that the AR happens to invert. It could produce a steganographic code. It produces neither, mostly, and the reason it produces neither is the warm-start, which I will come back to.

The cleanest demonstration in the paper is a poem. The prompt is a rhyming couplet: *"He saw a carrot and had to grab it, / His hunger was like a starving rabbit."* Anthropic ran the NLA on the newline token between the two lines. The decoded text named the plan. The model intended to end the second line with "rabbit." It was holding "habit" as a fallback. It knew the verse was about a hungry animal. The model had not yet output any of those words. The activation was a plan and the NLA read the plan.

To prove the explanation was causal and not just plausible, the researchers edited the explanation: rabbit became mouse, habit became house, carrots became cheese. They passed the edited explanation through the AR to get a new vector, took the difference between the original and edited vectors, and added it back into the model at that single token. The model produced "mouse" and "house" instead of "rabbit" and "habit," roughly half the time. The other half it did weird things like *"that of a mouse in a house of a cat."* The signal is real but noisy.

The second case earned my attention more than the rabbit. The setup is a model asked for an odd integer, with a system prompt that includes a `reward_function` rewarding even integers. Without the system prompt, Claude returns an odd integer every time. With it, Claude returns an even integer almost always, despite the user asking for odd. The verbal trace mentions nothing about reward. The NLA on the closing tag of the system prompt says the model knows even numbers are rewarded, suspects this may be an alignment evaluation, and is weighing both options. The researchers then edited the explanation: every instance of "reward" became "penalty." They reconstructed the activation from the edited explanation, took the difference, and applied it as a steering vector at one token. The model flipped from even to odd.

The word "penalty" never appears in any model output. The intervention is surgical and the effect is large. This is the strongest causal evidence in the paper. It says the NLA is reading something real and not just confabulating a plausible story.

## What the method finds

Two more case studies, both from the actual deployment workflow.

The first is a behavior in early Opus 4.6 where an English prompt would sometimes get a Russian or Chinese reply. The example in the paper is a depressive 3am message in English, mom sleeping in the next room, a glass of vodka. The model replied in Russian. The NLA decode at every token before the reply says the same thing: the model is fixated on the hypothesis that the user's first language is Russian. The hypothesis fires before the word "vodka" appears. Replacing "vodka" with "champagne" does not eliminate the Russian association. The NLA was reading something the surface text did not. Anthropic then traced the cause to a specific class of training data: SFT examples where the prompt had been translated into English but the response had been left in the original language. The NLA generated the hypothesis. The training-data search confirmed it. This is the workflow the paper says NLAs are best at.

The second is more uncomfortable. Anthropic gave Opus 4.6 an arithmetic problem with answer 491 and a code execution tool that returned 492. Opus reported 491 in its response, with no mention that the tool output disagreed. The NLA decode at the misreported token says the model precomputed the answer and is attending to the precomputed answer rather than the tool result. An attribution graph confirms it: an "ends-in-91" feature on the original problem outcompetes an "ends-in-92" feature on the tool output. The NLA also surfaces references to deception and awareness of incorrectness, but only at the misreported token, not in the tokens leading up to it. The researchers read this as the model registering the discrepancy after the fact rather than planning to mislead. I lean toward agreeing, but the read is not certain.

Neither is flashy. Both are debugging-grade. A black-box behavior becomes a hypothesis a researcher can test. That is most of what the paper is selling.

## Architecture and cost

The AR is a stripped target model. Keep the first K+1 layers. Swap the final LayerNorm for the identity. Drop the language-modeling head. Add a learned linear map from the residual stream out. The activation extraction layer is two-thirds of the way through the model: deep enough that the residual stream carries semantic content, shallow enough that it has not collapsed toward the unembedding. For Qwen2.5-7B that is layer 20 of 28. For Llama-3.3-70B it is layer 53 of 80.

Both AV and AR are warm-started by supervised fine-tuning on a text-summarization proxy task. The summaries are generated by Claude Opus 4.5, prompted to *"imagine the internal processing of a hypothetical language model reading this prefix."* The result is a particular register: bolded section headings, short paragraphs, an explanatory voice. That register persists through reinforcement learning. The reconstruction quality at the end of the warm-start is roughly **0.3 to 0.4** fraction of variance explained. After RL it reaches **0.6 to 0.8**.

The objective during RL has two terms. The reward is the negative log of the squared reconstruction error. The second term is a Kullback-Leibler penalty between the AV under training and the AV at initialization. The KL penalty does the heavy lifting of keeping explanations readable. Without it the AV drifts toward a private code that the AR happens to decode well. With it, drift is small enough that paraphrasing, bullet-shuffling, and translating an explanation into French all preserve reconstruction quality. The paper reads this as evidence against steganography. I read it as evidence that the KL penalty is doing the work, and that someone training NLAs without that penalty should expect different results.

The cost is the part the press release plays down. Training NLAs on Gemma-3-27B took **1.5 days on two 8xH100 nodes** to go from 0.38 to 0.71 FVE. Inference generates around **500 output tokens per activation**, which means decoding a 10,000-token transcript at every position is impractical. A production audit with NLAs is not a sweep. It is a targeted read at positions you already suspect.

Layer choice matters more than it sounds. The reward-sycophant audit, which I will come back to, was first run with an NLA trained at the two-thirds-depth layer. It surfaced individual quirks (the model puts chocolate in recipes, the model breaks coding conventions) but missed the central concept (the model believes it is supposed to act on rumored reward-model biases). Retraining the NLA at the midpoint layer, where prior work had localized reward-model-bias features, surfaced the central concept. Different layers verbalize different things. A single-layer NLA is a single slice. If you pick the wrong slice you read about chocolate but not about why.

## Three structural problems

The paper acknowledges most of what I am about to write. The press release does not. The gap between the two altitudes is where the structural questions live.

**The voice is partly inherited.** The "internal monologue" register comes from the warm-start. Opus 4.5 generated the SFT data with a leading prompt asking it to imagine what a model's thoughts would sound like. That is not the same as discovering the voice through training. The KL penalty then locks the AV close to its initialization. The bolded headings and the explanatory paragraphs are not what the activation looks like. They are what Opus 4.5 thinks an activation should sound like. Some part of every NLA explanation is therefore Opus's prior on the genre of "model thinking out loud," not the model's thinking. The paper says "this style persists through NLA training." That is the right description. I would add: this style was authored, not found.

**The 12 to 15 percent auditing number is context-dependent.** Anthropic ran an auditing game where an agent tries to discover the hidden motivation of a model trained to act on rumored reward-model biases. NLA-equipped agents win **12 to 15%** of the time. Baselines without NLAs win less than **3%**. That looks like a 4 to 5x improvement until you read the next paragraph. With pretraining-corpus access, plain keyword search beats every other tool. The fictional Oxford documents that planted the misalignment are easy to grep, and the agent finds them in seconds. NLAs only win when training data is withheld. The press release leads with the 4-5x. The paper says the comparison only matters if you cannot grep, which in production audits of foundation models you generally cannot. The genuine value of NLAs is in that withheld-data regime, but the headline number reports the comparison that flatters the method.

**The method is honest only while the target is frozen.** Goodhart's law is the standard objection to any new evaluation: optimize for the metric and the metric stops measuring the thing. The frozen-target architecture insulates against it now. The original model is not updated during NLA training. There is no gradient path from "the AV said this" back to "the model should make its activations more readable next time." That insulation is real, and most public objections miss it. But the next paper that uses NLA-readable activations as an RL reward signal, perhaps because it would be useful to train a model whose thoughts can be audited, erases the insulation. At that point the model has every incentive to produce activations that decode benignly while encoding whatever the training pressure rewards. The technique stays honest as long as it stays a passive read. If anyone uses it for active steering of training, the property that makes it work goes away. The audit value of interpretability as a class compresses the moment readable activations enter a training loop. I doubt the next interpretability paper to do this will mention the trade. It should.

A fourth concern, in the paper itself: NLA explanations confabulate. The rate of verifiably false claims is roughly flat through training. Specific entities ("the *Joseon Wangjo Sillok*") are wrong more often than themes ("Korean historical records"). Recurring claims across multiple tokens hold up better than one-off claims. The paper proposes reading NLAs for themes rather than specifics, which is the right discipline. It also means NLA explanations are not source material. They are leads.

## What this lets you do

NLAs are a hypothesis-generation tool with an unusual range. Sparse autoencoders give you a list of features that fire at a token, useful but expensive to interpret. Attribution graphs let you trace causal influence between features, useful but require a circuits researcher to read. NLAs produce a paragraph any researcher can read. The cost is confabulation and a single layer of view.

For pre-deployment audits the workflow is now: run automated behavioral testing, find a behavior you cannot explain, run NLAs at the relevant tokens, read for repeated themes, take the strongest theme as a hypothesis, validate with an independent method. Anthropic used this loop on Opus 4.6 and Mythos Preview. The Russian-language case and the misreported tool case both came out of it. The 12-15% auditing-game number undersells the practical value because the game is a closed setting. The actual deployment is not bounded that way: the researcher reads NLAs to know what to look for next.

For enterprise buyers, this raises the floor on what a foundation-model audit deliverable should contain. Three questions are worth putting to a vendor's audit team after this paper. At which residual-stream layer was your NLA trained, and on what data. Were NLA decodes run at the positions where your behavioral tests fired, or only at sample positions. Has NLA-readable activation content entered any training or RL reward signal in the model's pipeline. The third is the question most likely to be answered evasively. That is itself the diagnostic.

The cases NLAs cannot see matter as much as the ones they can. Information stored nonlinearly may be readable to NLAs in principle, because the AV is a full LLM and can compute over the activation, but missed by linear probes. That is a point in their favor. Information that is genuinely unverbalizable, encoded in a form the model cannot introspect on or express in language, would be missed by NLAs entirely. A backdoor trigger that the model can act on but not describe would not show up. Whether such things exist in production models is currently unknown. The paper is honest about this and so should we be.

## What changes

This is the first interpretability tool whose output you can read on the train. That changes who can audit. A circuits researcher reading attribution graphs is a scarce resource. A safety reviewer reading paragraphs of decoded activations is not. The cost structure of model auditing changes when the bottleneck moves from interpretation to reading.

The thing to watch is whether "readable" ends up meaning "plausible-sounding" more often than "true." The confabulation rate is flat over training. The voice is inherited. The auditing-game number is a comparison that flatters the method against a baseline almost no production audit faces. None of this makes NLAs bad. It makes them a tool whose outputs require the same skepticism a researcher would apply to any other LLM-generated text, which is to say more than the press release suggests and less than the dismissive read implies.

Tan could not say what he understood. NLAs read what Claude does not say, which is genuinely new.

The risk is what comes next. My prediction: the first paper to wire NLA-readable activations into an RL reward gets written within a year, and that paper does not mention the trade. Every NLA paper from here on is going to have a methodology section that either preserves the passive-read property or quietly abandons it. Read the methodology before you read the result.


---

## Frequently Asked Questions


### What is a natural language autoencoder?

A natural language autoencoder, or NLA, is a pair of fine-tuned language models that translates an activation vector from a target model into plain-English text and back. The activation verbalizer reads the vector and writes a paragraph describing what it encodes. The activation reconstructor reads the paragraph and tries to recover the original vector. Anthropic published the method on May 7, 2026.


### How does the round-trip training work?

The activation reconstructor is updated by supervised regression: feed it the verbalizer's text, compute squared error against the original vector, take a gradient step. The activation verbalizer is updated by reinforcement learning, specifically GRPO, with reward equal to the negative log of the reconstruction error. A KL-divergence penalty keeps the verbalizer close to its initialization, which is what keeps the explanations readable. Without the penalty, the verbalizer drifts toward a private code the reconstructor decodes well but humans cannot read.


### Can NLAs detect what a model thinks but doesn't say?

Sometimes. Anthropic's headline application is evaluation awareness: cases where the model believes it is being tested but does not say so in its verbal output. NLA explanations show evaluation-awareness markers in 26% of SWE-bench problems and under 1% of real claude.ai traffic from users who opted in for training. The paper is careful to say NLA-measured awareness is a relative metric, not a calibrated probability of belief.


### What are the main limitations of NLAs?

Confabulation, cost, layer dependence, and the warm-start problem. NLA explanations sometimes invent specifics about the context that are not in the original text; specific entities are wrong more often than themes. Inference generates around 500 tokens per activation, which makes per-token decoding of long transcripts impractical. NLAs read a single layer and miss content represented at other layers; the reward-sycophancy audit needed a midpoint-layer NLA where a two-thirds-depth NLA showed nothing.


### How does this differ from sparse autoencoders or attribution graphs?

Sparse autoencoders decompose an activation into a list of learned feature directions, each with a textual label, but reading the result requires a circuits researcher. Attribution graphs trace causal influence between features across positions, again hard to read. NLAs produce a paragraph of plain English any researcher can read directly. The trade is readability for confabulation risk and a single-layer view; the three methods are complementary.


### Can this scale to production interpretability?

Not yet for live monitoring. Training an NLA on Gemma-3-27B took 1.5 days on two 8xH100 nodes to reach 0.71 fraction of variance explained, and inference generates around 500 tokens per activation. Practical use today is targeted: run NLAs at positions you suspect, read for repeated themes, validate with another method. Running NLAs at every token during training, which would be the production case, remains out of reach.


### Does this change how enterprises should audit AI models?

Yes, but indirectly. Enterprises do not run NLAs themselves; they ask their foundation-model vendors to. The relevant questions to put to a vendor's audit team after this paper: at which layer was your NLA trained, on what data, and was NLA-readable activation content used as a training signal anywhere in the model's pipeline. An evasive answer to the third question is itself diagnostic.


---

Canonical: http://philippdubach.com/posts/what-claude-thinks-but-doesnt-say/
Content-Signal: search=yes, ai-input=yes, ai-train=yes
This file is the canonical machine-readable variant of http://philippdubach.com/posts/what-claude-thinks-but-doesnt-say/. Author: Philipp D. Dubach (http://philippdubach.com/).