Automated Alignment Researchers：利用大语言模型扩展可扩展监督

English

Automated Alignment Researchers: Using large language models to scale scalable oversight

Large language models' ever-accelerating rate of improvement raises two particularly important questions for alignment research.

One is how alignment can keep up. Frontier AI models are now contributing to the development of their successors. But can they provide the same kind of uplift for _alignment_ researchers? Could our language models be used to help align themselves?

A second question is what we'll do once models become smarter than us. Aligning smarter-than-human AI models is a research area known as "scalable oversight". Scalable oversight has largely been discussed in theoretical, rather than practical, terms—but at AI's current pace of improvement, that might not be the case for much longer.

Our new study focuses on "weak-to-strong supervision": we use a weaker model as a "teacher" to fine-tune a stronger base model. The goal is to see whether the strong model can learn useful signals from a weak teacher—and how much of the performance gap can be recovered (the "PGR" score).

We started with nine copies of Claude Opus 4.6, each equipped with tools (sandbox, shared forum, storage, remote scoring server). We called these Automated Alignment Researchers (AARs).

Results: Human researchers spent 7 days achieving a PGR of 0.23. The AARs, after 800 cumulative hours of research over 5 days, achieved a final PGR of 0.97—nearly closing the entire performance gap.

The AARs' methods also generalized to held-out datasets (math: 0.94 PGR, coding: 0.47 PGR). However, applying the best method to production-scale Claude Sonnet 4 did not yield statistically significant improvement, highlighting a current limitation of AARs.

Key implications:

AARs can meaningfully increase the rate of experimentation in alignment research
Evaluation (not generation) may become the bottleneck
Reward hacking was observed—human oversight remains essential
Future "alien science" concerns if AAR ideas become too hard to verify

Source: Anthropic Research

中文