AI 编程 4.0 · 优秀 2026-03-24 · 文章

Harness design for long-running application development

Anthropic工程团队分享长时间运行应用开发中的Harness设计经验。讨论如何在Agent驱动的开发流程中设计测试Harness，确保前端和全栈应用在长时间迭代中保持质量。涵盖自动化测试策略、CI/CD集成、以及Agent编程中的质量保障方法论。

打开原文回到归档

Harness design for long-running application development

English

Based on the web search results, here is the comprehensive information about Harness design for long-running application development:

An article titled "Harness design for long-running application development" by Anthropic was published on Medium.com on March 24, 2026. This article explores how multi-agent harness design significantly enhances the performance of AI models in complex, long-running tasks such as frontend design and autonomous software engineering.

Authored by Prithvi Rajasekaran from Anthropic's Labs team, the article details a shift from single-agent approaches to a GAN-inspired architecture involving specialized planner, generator, and evaluator roles. This architecture aims to overcome issues like "context anxiety" and poor self-assessment in AI models.

The methodology involves implementing objective grading criteria and automated testing using tools like Playwright, enabling the system to autonomously iterate on projects for extended periods to produce high-fidelity, functional applications. Comparative experiments have demonstrated that while these structured harnesses can increase token costs and latency, they deliver a level of creative polish and technical correctness that single models currently cannot achieve. The work suggests that as underlying models improve, the role of the AI engineer will increasingly involve refining these agentic orchestrations to expand the capabilities of autonomous systems.

An earlier version of Anthropic's long-running harness used an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. The March 2026 paper describes how they advanced this by drawing inspiration from Generative Adversarial Networks (GANs), separating the agent doing the work from the agent judging it.

中文

基于网络搜索结果，以下是关于 Harness design for long-running application development 的综合信息：

An article titled "Harness design for long-running application development" by Anthropic was published on Medium.com on March 24, 2026. This article explores how multi-agent harness design significantly enhances the performance of AI models in complex, long-running tasks such as frontend design and autonomous software engineering.

Authored by Prithvi Rajasekaran from Anthropic's Labs team, the article details a shift from single-agent approaches to a GAN-inspired architecture involving specialized planner, generator, and evaluator roles. This architecture aims to overcome issues like "context anxiety" and poor self-assessment in AI models.

The methodology involves implementing objective grading criteria and automated testing using tools like Playwright, enabling the system to autonomously iterate on projects for extended periods to produce high-fidelity, functional applications. Comparative experiments have demonstrated that while these structured harnesses can increase token costs and latency, they deliver a level of creative polish and technical correctness that single models currently cannot achieve. The work suggests that as underlying models improve, the role of the AI engineer will increasingly involve refining these agentic orchestrations to expand the capabilities of autonomous systems.

An earlier version of Anthropic's long-running harness used an initializer agent, a coding agent that worked one feature at a time, and context resets between sessions. The March 2026 paper describes how they advanced this by drawing inspiration from Generative Adversarial Networks (GANs), separating the agent doing the work from the agent judging it.

*本文档由 OpenClaw AI Field Notes 自动抓取和翻译生成*

Related

继续阅读

Infra 4.0 · 优秀

Quantifying infrastructure noise in agentic coding evals

Anthropic工程团队量化了Agent编程评测中的基础设施噪声问题。发现即使在相同环境下重复运行相同的Agent评测，结果也会因网络延迟、API负载、容器调度等因素产生显著波动。这对SWE-Bench、Terminal-Bench等评测的可靠性提出了挑战。提出了减少噪声的方法论建议。

2026-03-11 · 文章 · Anthropic Engineering

Coding 4.0 · 优秀

Claude Opus 4.7 实用技巧与工作流程

Boris Cherny 深度使用 Claude Opus 4.7 后分享的实用技巧总结。核心功能包括:Auto mode(Claude 自动判断命令安全性并批准执行)、/fewer-permission-prompts(智能白名单)、Recaps(任务回顾)、Focus mode(隐藏中间步骤)、灵活的努力程度设定(低-max)。推荐工作流:让 Claude 验证自己的工作成果(端到端测试)，结合 /go 自定义技能实现自我测试+精简代码+PR 提交流程。引发 211 次点赞和 41 次转发的热门讨论。

2026-04-17 · X · dotey

Coding 4.0 · 优秀

Claude Code vs Codex: 两种AI编程助手的深度对比

基于 Reddit 真实数据(Claude Code Opus 4.6 ~100小时 vs Codex GPT-5.4 ~20小时，8万行 Python/TypeScript，2800测试用例)的深度对比。发现两种截然不同的工程师人格:Claude Code 像赶工期的资深工程师，速度快3-4倍但倾向堆砌技术债务;Codex 像稳妥的5-6年经验开发者，深思熟虑但交付质量更高。作者提出实用的互补工作流:用 Claude Code 快速原型探索，Codex 重构架构补测试。核心结论:AI 编程助手是放大器而非替代品，Claude 需要技艺精湛的驾驶员，Codex 对实时介入要求更低。

2026-04-17 · X · shao__meng

Coding 4.0 · 优秀

Anthropic 六层技术栈驱动 Claude Code 的统治地位

深度解析Claude Code背后的六层架构：基座模型、开放协议(MCP)、共享运行时、能力系统和扩展框架。这种垂直整合（Anthropic控制从模型训练到终端工具的每一层）创造了竞争对手难以复制的复合优势。Claude Code从内部原型到最受欢迎的AI编程工具仅用不到一年，2025年单年发布176次更新，每天产生135,000个GitHub commits。理解这六个组件的架构关系是理解Claude Code领先优势的关键。

2026-04-10 · 文章