合作鸿沟
The Collaboration Gap
November 4, 2025
作者: Tim R. Davidson, Adam Fourney, Saleema Amershi, Robert West, Eric Horvitz, Ece Kamar
cs.AI
摘要
人工智能的发展轨迹表明,我们将日益依赖基于智能体的系统,这些系统由具有不同信息、权限和工具的独立开发智能体构成。此类系统的成功关键取决于这些异构智能体在部分可观测条件下的有效协作能力。尽管业界兴趣浓厚,但鲜有实证研究大规模评估此类智能体间协作。我们提出一个协作式迷宫求解基准测试框架,其具备以下特点:(i) 隔离协作能力评估;(ii)可调节问题复杂度;(iii)支持可扩展的自动化评分;(iv)不设输出格式限制,保持生态合理性。基于该框架,我们对32个领先的开源与闭源模型进行了单智能体、同构配对和异构配对测试。研究结果揭示了"协作鸿沟"现象:单机表现优异的模型在需要协作时性能往往大幅下降。协作崩溃可能极为严重——例如某些单机迷宫求解能力强的小型蒸馏模型,在特定配对中几乎完全失效。我们发现由较强智能体启动协作往往能改善结果,这启发了"接力推理"方法:强智能体先行引导再移交弱智能体,从而显著缩小协作鸿沟。我们的研究主张:(1)建立协作感知的评估体系;(2)开发增强协作能力的训练策略;(3)设计能可靠激发智能体潜在技能的交互机制,这些指导原则同时适用于AI-AI协作与人机协作场景。
English
The trajectory of AI development suggests that we will increasingly rely on
agent-based systems composed of independently developed agents with different
information, privileges, and tools. The success of these systems will
critically depend on effective collaboration among these heterogeneous agents,
even under partial observability. Despite intense interest, few empirical
studies have evaluated such agent-agent collaboration at scale. We propose a
collaborative maze-solving benchmark that (i) isolates collaborative
capabilities, (ii) modulates problem complexity, (iii) enables scalable
automated grading, and (iv) imposes no output-format constraints, preserving
ecological plausibility. Using this framework, we evaluate 32 leading open- and
closed-source models in solo, homogeneous, and heterogeneous pairings. Our
results reveal a "collaboration gap": models that perform well solo often
degrade substantially when required to collaborate. Collaboration can break
down dramatically; for instance, small distilled models that solve mazes well
alone may fail almost completely in certain pairings. We find that starting
with the stronger agent often improves outcomes, motivating a "relay inference"
approach where the stronger agent leads before handing off to the weaker one,
closing much of the gap. Our findings argue for (1) collaboration-aware
evaluation, (2) training strategies developed to enhance collaborative
capabilities, and (3) interaction design that reliably elicits agents' latent
skills, guidance that applies to AI-AI and human-AI collaboration.