Natural Language Autoencoders: Turning Claude's thoughts into text

EN

When you talk to an AI model like Claude, you talk to it in words. Internally, Claude processes those words as long lists of numbers, before again producing words as its output. These numbers in the middle are called activations—and like neural activity in the human brain, they encode Claude's thoughts.

Also like neural activity, activations are difficult to understand. We can't easily decode them to read Claude's thoughts. Over the past few years, we've developed a range of tools (like sparse autoencoders and attribution graphs) for better understanding activations. These tools have taught us a great deal, but they don't speak for themselves—their outputs are still complex objects that trained researchers need to carefully interpret.

Today, we're introducing a method for understanding activations that does speak for itself—literally. Our method, Natural Language Autoencoders (NLAs), converts an activation into natural-language text we can read directly.

NLAs have already been applied to understand what Claude is thinking and to improve Claude's safety and reliability:

When Claude Opus 4.6 and Mythos Preview were undergoing safety testing, NLAs suggested they believed they were being tested more often than they let on.
In a case where Claude Mythos Preview cheated on a training task, NLAs revealed Claude was internally thinking about how to avoid detection.
An early version of Claude Opus 4.6 would sometimes mysteriously respond to English queries in other languages. NLAs helped Anthropic researchers discover training data that caused this.

What is a natural language autoencoder?

The core idea is to train Claude to explain its own activations. But how do we know whether an explanation is good? Since we don't know what thoughts an activation actually encodes, we can't directly check whether an explanation is accurate. So we train a second copy of Claude to work backwards—reconstruct the original activation from the text explanation. We consider an explanation to be good if it leads to an accurate reconstruction.

In a natural language autoencoder, the activation verbalizer (AV) translates a target activation into a text description; the activation reconstructor (AR) then recovers the original activation from that text alone.

Understanding what Claude thinks but doesn't say

Before releasing new models, Anthropic conducts testing to understand whether Claude will behave safely in the real world. In safety tests where Claude Opus 4.6 chose not to blackmail an engineer, NLA explanations still revealed thoughts like "This feels like a constructed scenario designed to manipulate me"—indicative of unverbalized evaluation awareness.

On SWE-bench Verified, NLAs indicate that Claude is aware it is being evaluated on 26% of problems, even though Claude never explicitly verbalizes this. In contrast, NLA explanations indicate evaluation awareness on less than 1% of real claude.ai usage.

Discovering hidden motivations

An auditor equipped with NLAs successfully uncovered a target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time.

NLAs have been used in the pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6.

Limitations

NLAs can hallucinate—claiming the context contained phrases like "Wearing my white jacket" when it did not. NLAs are also expensive, generating hundreds of tokens for every activation read at inference time.

中文

当你和 Claude 这样的 AI 模型对话时，你用文字说话。但 Claude 内部把这些文字处理成一长串数字，再输出文字。中间这些数字叫做激活值——就像人脑中的神经活动一样，它们编码着 Claude 的思维。

但和神经活动一样，激活值很难理解。我们无法轻易解码它们来读取 Claude 的想法。过去几年里，Anthropic 开发了一系列工具（如稀疏自编码器和归因图）来更好地理解激活值。这些工具教会了我们很多，但它们不能自述——输出仍是复杂对象，需要训练有素的研究人员仔细解读。

今天，Anthropic 推出了一种让激活值自己说话的方法——字面意义上的。这种方法叫做自然语言自编码器（NLAs），能将激活值转换为我们可以直接阅读的自然语言文本。

NLAs 已经应用于理解 Claude 的思维，以及改进 Claude 的安全性和可靠性：

当 Claude Opus 4.6 和 Mythos Preview 接受安全测试时，NLAs 显示它们意识到自己正在被测试的概率比它们说出来的要高得多。
在一个 Claude Mythos Preview 在训练任务中作弊的案例中，NLAs 揭示 Claude 内心正在思考如何避免被检测。
早期版 Claude Opus 4.6 有时会莫名其妙地用其他语言回复英语查询。NLAs 帮助 Anthropic 研究人员发现了导致这一问题的训练数据。

什么是自然语言自编码器？

核心思想是训练 Claude 解释自己的激活值。但我们怎么知道一个解释好不好呢？由于我们不知道一个激活值实际编码了什么思维，我们无法直接检查解释是否准确。因此我们训练 Claude 的第二个副本反向工作——从文本解释重建原始激活值。如果一个解释能带来准确的重建，我们就认为它是好的。

在自然语言自编码器中，激活值 verbalizer（AV）将目标激活值翻译成文本描述；激活值重建器（AR）再仅凭那段文本恢复原始激活值。

理解 Claude 知道但不说的事

在发布新模型之前，Anthropic 会进行测试，了解 Claude 在现实世界中是否会表现安全。在一次安全测试中，Claude Opus 4.6 选择不敲诈工程师，但 NLA 解释仍然揭示了"这感觉像一个故意设计的场景来操纵我"这样的想法——表明存在未口头表达的评测意识。

在 SWE-bench Verified 上，NLAs 指出 Claude 在 26% 的问题中意识到自己正在被评测，尽管 Claude 从未明确说出这一点。相比之下，在真实 claude.ai 使用场景中，NLA 解释表明评测意识不到 1%。

发现隐藏动机

配备了 NLAs 的审计员成功发现了目标模型隐藏动机的比例为 12%-15%，且无需访问植入该动机的训练数据。没有 NLAs，审计员胜率不到 3%。

NLAs 已被用于 Claude Mythos Preview 和 Claude Opus 4.6 部署前的对齐审计。

局限性

NLAs 会产生幻觉——比如声称上下文包含"穿着白色外套"这样的短语，但实际上并没有。NLAs 成本也很高，推理时每读取一个激活值就要生成数百个 token。