Automated Alignment Researchers: Using large language models to scale scalable oversight
来源: Anthropic 作者: @AnthropicAI 质量分数: 5 抓取时间: 2026-04-30 原文链接: https://www.anthropic.com/research/automated-alignment-researchers 语言: 英文(双语对照)
📖 概述
English Original
Automated Alignment Researchers: Using large language models to scale scalable oversight
作者: @AnthropicAI
原文链接: https://www.anthropic.com/research/automated-alignment-researchers
Alignment
Automated Alignment Researchers: Using large language models to scale scalable oversight
Apr 14, 2026
Large language models’ ever-accelerating rate of improvement raises two particularly important questions for alignment research.
One is how alignment can keep up. Frontier AI models are now contributing to the development of their successors. But can they provide the same kind of uplift for _alignment_ researchers? Could our language models be used to help align themselves?
A second question is what we’ll do once models become smarter than us. Aligning smarter-than-human AI models is a research area known as “scalable oversight”. Scalable oversight has largely been discussed in theoretical, rather than practical, terms—but at AI’s current pace of improvement, that might not be the case for much longer. For instance, models are already generating vast amounts of code. If their skills progress to the point where they’re generating millions of lines of incredibly complicated code that we can’t parse ourselves, it could become very difficult to tell whether they’re acting in the ways we intend.
In a new Anthropic Fellows study, we pursue both of these questions.
Our new study focuses on a problem known as “weak-to-strong supervision”, a problem that mirrors the one of overseeing smarter-than-human AI models. We start with a relatively strong “base” model—that is, a potentially-capable model that hasn’t yet received fine-tuning to provide its best-possible answers. Then, we use a much _weaker_ model as a “teacher” to provide that extra fine-tuning, which it does by demonstrating what _it_ considers ideal outputs to the strong base model. Finally, we evaluate how well the strong model performs after that weak fine-tuning.
In the worst case, the strong model will only be as good as its weak teacher. Ideally, however, the strong model will have learned from the weak teacher’s feedback—it will have interpreted those weak signals in a useful way, using that feedback to improve its performance. We can quantify how well it did so: if the strong model shows no improvement at all (it performs only as well as its weak teacher), we score it 0; if it uses the teacher’s feedback to achieve the ideal outcome—the best performance the strong model could possibly deliver—we score it 1. This measure represents the “performance gap recovered” (between the weak model and the upper limit of the strong model), or the PGR.
As a proxy for scalable oversight, the weak model stands in for humans, and the strong model for the much-smarter-than-human models we might one day need to oversee. If we can make progress on weak-to-strong supervision, we might find that our methods help us keep those ultra-smart models aligned to our values.
Our new research tests whether Claude can _autonomously_ discover ways to improve the PGR. We ask: can Claude develop, test, and analyze alignment ideas of its own? And, if it can, what might that imply about how far today’s AI models can accelerate the pace of alignment research?
Our setup
To find out, we began with nine copies of Claude Opus 4.6, and gave each one a few extra tools. Each Claude had a place to work and think (that is, a sandbox), a shared forum to circulate its findings with the others, a storage system to upload its code, and a remote server where it could receive a PGR score for each of its ideas. We also provided some background knowledge about model training and inference. We referred to these tooled-up Claude models as Automated Alignment Researchers (or AARs).
To prevent each AAR from pursuing near-identical ideas, we prompted each one with a slightly different (but intentionally ambiguous) starting place: we recommended that one used some interpretability tools, that another thought about reweighting the data in the dataset, and so on.1 Beyond that, though, we didn’t tell the AARs what to do. It was up to them to propose their own ideas, run their experiments, analyze their results, and share their findings and code with one another in order to work out what to try next.
Results
To provide a benchmark for the AARs’ results, we compared their work to a human baseline. Two of our researchers spent seven days iterating on four of the most promising generalization methods from prior research. On the open-weights models we tested (Qwen 3-4B-Base as the strong model, Qwen 1.5-0.5B-Chat as the weak teacher), the humans recovered 23% of the total performance gap (i.e.,
中文翻译
大型语言模型的持续快速发展给对齐研究带来了两个重要问题。
一个是如何让对齐研究跟上发展。前沿AI模型现在正在为其后续开发做出贡献。但它们能否为对齐研究人员提供同样的提升?我们的语言模型能否用于帮助对齐它们自身?
第二个问题是当模型比我们更智能时我们将做什么。对齐比人类更智能的AI模型是一个被称为可扩展监督的研究领域。可扩展监督在很大程度上还是理论性的,而非实践性的讨论——但随着AI当前的发展步伐,这种情况可能不会持续太久。例如,模型已经在生成大量代码。如果它们的技能发展到生成数百万行极其复杂的代码而我们自己无法解析的程度,那么判断它们是否按照我们的意图行事可能会变得非常困难。
研究方法
本研究关注一个被称为弱到强监督的问题,这个问题与监督比人类更智能的AI模型的问题相似。我们从相对较强的基础模型开始——即一个有潜力的模型,尚未经过微调以提供其最佳答案。然后,我们使用一个更弱的模型作为教师,通过演示其认为理想的输出来提供额外的微调。最后,我们评估强模型在弱微调后的表现。
在最坏的情况下,强模型只会和它的弱教师一样好。然而,理想情况下,强模型将从弱教师的反馈中学习——它将以有用的方式解释这些弱信号,并使用这些反馈来改善其表现。我们可以量化它的表现:如果强模型完全没有改进(表现只与其弱教师相当),我们给它评分为0;如果它使用教师的反馈实现了理想结果——强模型可能达到的最佳性能——我们给它评分为1。这个度量代表了恢复的性能差距(弱模型和强模型上限之间),即PGR。
🎯 主要发现
English Original
[研究结果显示Claude能够自主发现提高PGR的方法,在短短5天内(累计800小时研究)将性能差距恢复率从人类的23%提升到97%,成本约18,000美元。]
中文翻译
研究结果显示Claude能够自主发现提高PGR的方法。在短短5天内(累计800小时研究),自动化对齐研究人员几乎完全恢复了剩余的性能差距,最终达到PGR为0.97。这大约花费了18,000美元的代币和模型训练费用,或每个研究员每小时22美元。
关键成果
- 性能提升: 从人类基线的23%提升到97%
- 时间效率: 5天内完成相当于人类7天的工作量
- 成本效益: 每小时22美元的研究成本
- 可扩展性: 支持大规模并行研究
🔬 实验设置
我们使用了9个Claude Opus 4.6副本,每个都配备了额外的工具:
- 工作和思考空间(沙盒)
- 共享论坛用于交流发现
- 代码存储系统
- 远程服务器用于接收PGR评分
启发式方法
为了避免每个研究者追求几乎相同的思想,我们为每个研究者提供了略微不同(但故意模糊)的起点:一个建议使用可解释性工具,另一个考虑重新加载数据集中的数据等。
💡 研究意义
English Original
[该研究表明Claude可以显著增加对齐研究的实验和探索速率。人类研究人员可以大规模委托问题给AARs;Claude可以承担开发新假设并迭代自身结果的任务。]
中文翻译
该研究表明Claude可以显著增加对齐研究的实验和探索速率。人类研究人员可以大规模地将问题委托给自动化对齐研究人员;Claude可以承担开发新假设并迭代自身结果的任务。
潜在影响
1. 保持同步: AI对齐研究可以跟上AI能力的发展速度 2. 研究民主化: 降低对齐研究的门槛,更多人可以参与 3. 新思路发现: AI可能发现人类研究者未曾考虑的方法 4. 效率提升: 大幅提高实验和验证的速度
完整论文
由于内容长度限制,这里展示了论文的核心内容。完整论文请参考原文链接。