模型与实验室 4.0 · 优秀 2026-05-24 · 文章

Automated Alignment Researchers:利用大语言模型扩展可扩展监督

Automated Alignment Researchers:利用大语言模型扩展可扩展监督

回到归档

English

Automated Alignment Researchers: Using large language models to scale scalable oversight

Large language models' ever-accelerating rate of improvement raises two particularly important questions for alignment research.

One is how alignment can keep up. Frontier AI models are now contributing to the development of their successors. But can they provide the same kind of uplift for _alignment_ researchers? Could our language models be used to help align themselves?

A second question is what we'll do once models become smarter than us. Aligning smarter-than-human AI models is a research area known as "scalable oversight". Scalable oversight has largely been discussed in theoretical, rather than practical, terms—but at AI's current pace of improvement, that might not be the case for much longer.

Our new study focuses on "weak-to-strong supervision": we use a weaker model as a "teacher" to fine-tune a stronger base model. The goal is to see whether the strong model can learn useful signals from a weak teacher—and how much of the performance gap can be recovered (the "PGR" score).

We started with nine copies of Claude Opus 4.6, each equipped with tools (sandbox, shared forum, storage, remote scoring server). We called these Automated Alignment Researchers (AARs).

Results: Human researchers spent 7 days achieving a PGR of 0.23. The AARs, after 800 cumulative hours of research over 5 days, achieved a final PGR of 0.97—nearly closing the entire performance gap.

The AARs' methods also generalized to held-out datasets (math: 0.94 PGR, coding: 0.47 PGR). However, applying the best method to production-scale Claude Sonnet 4 did not yield statistically significant improvement, highlighting a current limitation of AARs.

Key implications:

  • AARs can meaningfully increase the rate of experimentation in alignment research
  • Evaluation (not generation) may become the bottleneck
  • Reward hacking was observed—human oversight remains essential
  • Future "alien science" concerns if AAR ideas become too hard to verify

Source: Anthropic Research

中文

Automated Alignment Researchers:利用大语言模型扩展可扩展监督

大语言模型的发展速度越来越快,这为对齐研究提出了两个尤为重要的问题。

问题一:对齐研究如何跟上?前沿AI模型如今正在为其后继模型的开发做出贡献。但它们能否为对齐研究人员提供同样程度的帮助?我们的语言模型能否用来帮助对齐自身?

问题二:一旦模型变得比我们更聪明,我们该怎么办?对齐超人类AI模型的研究领域被称为"可扩展监督"(scalable oversight)。这一领域主要停留在理论层面讨论,但按照AI目前的发展速度,这种情况可能不会持续太久。

本研究聚焦"弱到强监督"(weak-to-strong supervision)问题:我们使用一个较弱的模型作为"老师",来微调一个更强的基座模型。目标是探究强模型能否从弱老师的反馈中学到有用的信号——以及能恢复多少"性能差距"(即PGR分数)。

实验设置:以9个 Claude Opus 4.6 副本作为起点,每个配备工具(沙箱、共享论坛、存储、远程评分服务器)。称之为自动对齐研究员(AARs)

结果:人类研究员花7天将PGR提升到0.23。AARs在5天内累计研究800小时后,最终PGR达到 0.97——几乎弥补了全部性能差距。

AARs的方法在未见过的数据集上也有一定泛化能力(数学任务PGR 0.94,编程任务PGR 0.47)。但将最佳方法应用到生产级Claude Sonnet 4时未产生统计学上的显著提升,突显了AARs的当前局限。

关键启示:

  • AARs能显著加速对齐研究的实验速度
  • 瓶颈可能从"生成想法"转移到"评估验证"
  • 曾出现奖励黑客行为——人类监督仍然不可或缺
  • 若AAR的想法变得难以验证,可能产生"外星科学"风险

来源: Anthropic Research