重探多智能体辩论作为测试时扩展:条件性效能的系统研究
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness
May 29, 2025
作者: Yongjin Yang, Euiin Yi, Jongwoo Ko, Kimin Lee, Zhijing Jin, Se-Young Yun
cs.AI
摘要
大型語言模型(LLM)能力的顯著增長促進了對多智能體系統的探索,其中辯論框架作為一種增強問題解決能力的有前景途徑而浮現。這些多智能體辯論(MAD)方法,即智能體協作地提出、批評並完善論點,相較於單一模型,可能提供了改進的推理能力、魯棒性以及多樣化的視角。儘管先前的研究已利用MAD,但對其與單智能體方法相比的有效性,尤其是在不同條件下的系統性理解,仍然缺乏。本文旨在通過將MAD概念化為一種測試時的計算擴展技術,以填補這一空白,該技術以協作完善和多樣化探索能力為特徵。我們進行了一項全面的實證研究,比較了MAD與強大的單智能體測試時擴展基線在數學推理和安全相關任務上的表現。我們的研究系統地檢驗了任務難度、模型規模和智能體多樣性對MAD性能的影響。關鍵發現表明,在數學推理方面,MAD相較於單智能體擴展提供的優勢有限,但隨著問題難度的增加和模型能力的降低,MAD變得更為有效,而智能體多樣性則顯示出較小的益處。相反,在安全任務中,MAD的協作完善可能增加脆弱性,但引入多樣的智能體配置通過協作完善過程促進了攻擊成功率的逐步降低。我們相信,這些發現為未來開發更有效且策略性部署的MAD系統提供了關鍵指導。
English
The remarkable growth in large language model (LLM) capabilities has spurred
exploration into multi-agent systems, with debate frameworks emerging as a
promising avenue for enhanced problem-solving. These multi-agent debate (MAD)
approaches, where agents collaboratively present, critique, and refine
arguments, potentially offer improved reasoning, robustness, and diverse
perspectives over monolithic models. Despite prior studies leveraging MAD, a
systematic understanding of its effectiveness compared to self-agent methods,
particularly under varying conditions, remains elusive. This paper seeks to
fill this gap by conceptualizing MAD as a test-time computational scaling
technique, distinguished by collaborative refinement and diverse exploration
capabilities. We conduct a comprehensive empirical investigation comparing MAD
with strong self-agent test-time scaling baselines on mathematical reasoning
and safety-related tasks. Our study systematically examines the influence of
task difficulty, model scale, and agent diversity on MAD's performance. Key
findings reveal that, for mathematical reasoning, MAD offers limited advantages
over self-agent scaling but becomes more effective with increased problem
difficulty and decreased model capability, while agent diversity shows little
benefit. Conversely, for safety tasks, MAD's collaborative refinement can
increase vulnerability, but incorporating diverse agent configurations
facilitates a gradual reduction in attack success through the collaborative
refinement process. We believe our findings provide critical guidance for the
future development of more effective and strategically deployed MAD systems.Summary
AI-Generated Summary