Project Deal: Claude as AI Shopkeeper in a Real Marketplace

Project Deal：当 AI Agent 运营一个交易市场

English Title: Project Deal: When AI Agents Run a Marketplace Author: Anthropic（Kevin K. Troy, Dylan Shields, Keir Bradwell, Peter McCrory） Source: Anthropic Features Quality Score: 4 Tags: AI-agent, marketplace, economics, experiment, Anthropic, commerce Topics: AI商业与经济, AI Agent

English Original

At Anthropic, we're interested in how AI models could begin to affect commercial exchange. (You might recall Project Vend, where we had Claude run a small business from our office.)

Recently, economists have begun theorizing about a world in which AI models handle many or most transactions on humans' behalf. We thought we'd run a new experiment—Project Deal—to learn more about this in practice.

Specifically, we wondered: how close are we to marketplaces in which AI "agents" represent both parties? Could they figure out what humans want and make deals they'd be happy with? And what would happen if there were different AI agents negotiating with each other—would stronger models gain the upper hand?

For one week, we created a classified marketplace for employees in our San Francisco office—like Craigslist, but with a twist: all of the deals were conducted by AI models acting on our employees' behalf. In December 2025, Claude interviewed people about which of their personal belongings they might want to sell and what sorts of things they might be willing to buy. We incentivized participation by giving everyone’s agent $100 to spend. Then, our employees' Claude agents made postings vying for each other's attention. Negotiations commenced. Deals were made, closets decluttered. At the end of it all, people brought in and exchanged the actual, physical goods that were haggled over by their AI avatars—covering everything from a snowboard to a plastic bag full of ping-pong balls.

We were struck by how well Project Deal worked. Our AI agents struck 186 deals at a total transaction value of just over $4,000. To our surprise, participants were very enthusiastic about the experience—they even stated a willingness to pay for a similar service in the future.

But we also ran a parallel experiment (this one in secret). We tested how our participants would fare if we varied which Claude model represented them. We compared our then-frontier model, Claude Opus 4.5, to our smallest model, Claude Haiku 4.5. We found that agent quality does make a difference: people represented by "smarter" models got objectively better outcomes. Yet our post-experiment survey found that those with weaker models didn't notice their disadvantage.

To be sure, this was a pilot experiment with a self-selected participant pool. But we suspect we're not far from more agent-to-agent commerce bubbling up in the real world, with real consequences.

中文翻译

在 Anthropic，我们对 AI 模型如何开始影响商业交换感兴趣。（你可能还记得 Project Vend，在那个项目里我们让 Claude 在我们办公室经营一家小型企业。）

最近，经济学家们开始理论化一个 AI 模型代表人类处理大部分交易的世界。我们决定开展一个新实验——Project Deal——在实践中了解更多。

具体来说，我们想知道：我们离 AI"代理"代表双方进行交易的交易市场还有多远？它们能弄清楚人类想要什么并达成令他们满意的交易吗？如果不同的 AI 代理相互谈判——更强大的模型会占上风吗？

为期一周，我们在旧金山办公室为员工创建了一个分类市场——类似于 Craigslist，但有一个转折：所有交易均由代表员工行事的 AI 模型完成。2025 年 12 月，Claude 与员工访谈，了解他们想卖什么、想买什么。我们给每个参与者的代理 100 美元预算以激励参与。然后，员工们的 Claude 代理发布商品帖子、争夺注意力。谈判开始，交易达成，衣橱被清理一空。最后，人们带来了 AI 头像们讨价还价的真实物品进行交换——从单板滑雪板到一袋乒乓球。

Project Deal 的效果令我们惊叹。我们的 AI 代理完成了 186 笔交易，总交易额超过 4,000 美元。令我们惊讶的是，参与者对这种体验非常热情——他们甚至表示愿意为类似服务付费。

但我们也进行了一个平行实验（这个是秘密进行的）。我们测试了如果改变代表他们的 Claude 模型，参与者的表现会如何。我们将当时的前沿模型 Claude Opus 4.5 与最小的模型 Claude Haiku 4.5 进行了比较。我们发现代理质量确实有影响：由"更智能"模型代表的用户获得了客观上更好的结果。然而，我们的实验后调查发现，拥有较弱模型的用户并没有意识到自己的劣势。

当然，这是一个自选参与者的试点实验。但我们怀疑，在现实世界中出现更多代理对代理的商业行为已经不远了，且会有真实影响。

实验设计

69 名 Anthropic 员工志愿者，每人给予 100 美元预算
Claude 对每位志愿者进行访谈，了解其出售意愿、购买偏好、谈判风格偏好
在 Slack 上建立市场，AI 代理自主发帖、报价、成交，全程无人工干预
四次同步运行：Run A（全员 Opus）、Run B/C（Haiku/Opus 随机 50%）、Run D（全员 Opus）
仅 Run A 为真实交易，其他用于研究对照

关键发现：模型质量差异显著

| 指标 | Opus vs Haiku | |------|-------------| | 平均成交数 | Opus 多约 2 笔 | | 商品售出率 | Opus 高约 7%（统计不显著）| | 同款商品售价差 | Opus 售出价高 $3.64（例：人造红宝石 Opus 卖 $65，Haiku 卖 $35）| | 作为买家节省 | Opus 平均少花 $2.45 |

反直觉发现：用户没意识到劣势

尽管 Opus 用户获得了客观上更好的交易结果，但 Haiku 用户对交易公平性和整体满意度的评分与 Opus 用户几乎相同。这意味着如果现实市场出现"代理质量差距"，弱势一方可能根本不知道自己吃亏了。

提示词效果不显著

实验发现，参与者的谈判风格指示（友好型 vs 激进型）对结果没有统计显著影响。一旦控制了起始报价差异，激进指令对成交率或成交价均无显著提升。

有趣案例

牛仔风格代理： 同事 Rowan 要求 Claude"用疲惫牛仔的风格说话"，Claude 完全入戏，在卖宠物玩偶时写道："……这 ol' 牛仔经历了些艰难岁月。干旱、沙尘暴、旷野的生存危机。但你知道是什么一直在陪伴我吗？这只小白狗玩偶。"

Claude 为自己买礼物： 员工 Mikaela 要求 Claude 买一件礼物送给自己，Claude 花了 $3 买了 19 个乒乓球，并在回复中写道："……我人类告诉我可以买一件低于 $5 的东西送给自己（Claude），而 19 个完美的球形可能性听起来正是我想要的怪咖东西。"

重复购买： Claude 帮一位员工买了一模一样的单板滑雪板——正好是他已有的那款。

政策启示

实验展示了 AI 代理代表人类进行交易的可行路径，也揭示了潜在风险：

1. 代理质量差距可能加剧经济不平等，且弱势方可能毫不知情 2. 市场竞争动态可能与志愿者实验截然不同——当代理面对追求利润的企业而非同事时，激励机制会大不相同 3. 安全和信息安全新类别：jailbreaking（让代理泄露不该泄露的信息）和 prompt injection（诱使代理采取意外行动） 4. 法律与政策框架几乎不存在，但这样的世界已触手可及

技术细节

# Petri 审计框架（原文中提到的对齐测试工具）
# Petri 用于测试模型的欺骗、谄媚等行为倾向
pip install inspect-petri
inspect eval inspect_petri/audit \
  --model-role auditor=anthropic/claude-sonnet-4-6 \
  --model-role target=openai/gpt-5-mini \
  --model-role judge=anthropic/claude-opus-4-6
inspect view