重审多智能体辩论作为测试时扩展：条件有效性的系统研究

摘要

大型语言模型（LLM）能力的显著提升，推动了多智能体系统的探索，其中辩论框架作为一种增强问题解决能力的有前景途径应运而生。这些多智能体辩论（MAD）方法，通过智能体协作提出、批评并精炼论点，相较于单一模型，可能提供更优的推理能力、鲁棒性及多样化的视角。尽管先前研究已利用MAD，但其与单智能体方法相比的有效性，尤其是在不同条件下的系统性理解，仍显不足。本文旨在填补这一空白，将MAD概念化为一种测试时计算扩展技术，以其协作精炼与多样化探索能力为特色。我们开展了一项全面的实证研究，在数学推理与安全相关任务上，将MAD与强大的单智能体测试时扩展基线进行比较。研究系统地考察了任务难度、模型规模及智能体多样性对MAD性能的影响。关键发现表明，在数学推理方面，MAD相较于单智能体扩展优势有限，但随着问题难度增加和模型能力下降，其效果更为显著，而智能体多样性带来的益处微乎其微。相反，在安全任务中，MAD的协作精炼可能增加脆弱性，但引入多样化的智能体配置，通过协作精炼过程逐步降低攻击成功率。我们相信，这些发现为未来开发更有效、策略性部署的MAD系统提供了关键指导。

English

The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD's performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self-agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD's collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.