マルチエージェント討論をテスト時スケーリングとして再考する：条件付き有効性の体系的検討

要旨

大規模言語モデル（LLM）の能力における顕著な進展は、マルチエージェントシステムの探求を促し、ディベートフレームワークが強化された問題解決の有望な手法として浮上している。これらのマルチエージェントディベート（MAD）アプローチでは、エージェントが協力して議論を提示、批判、洗練させることで、単一モデルと比較して改善された推論能力、堅牢性、多様な視点を提供する可能性がある。これまでの研究でMADが活用されてきたにもかかわらず、特に様々な条件下での自己エージェント手法との比較におけるその有効性についての体系的な理解は未だ不十分である。本論文は、MADを協調的な洗練と多様な探索能力を特徴とするテスト時計算スケーリング技術として概念化し、このギャップを埋めることを目指す。数学的推論および安全性関連タスクにおいて、MADと強力な自己エージェントテスト時スケーリングベースラインを比較する包括的な実証調査を実施する。本研究では、タスクの難易度、モデルの規模、エージェントの多様性がMADの性能に及ぼす影響を体系的に検証する。主要な知見として、数学的推論においては、MADは自己エージェントスケーリングと比較して限定的な利点しか提供しないが、問題の難易度が増し、モデルの能力が低下するにつれてより効果的となり、エージェントの多様性はほとんど利益をもたらさないことが明らかになった。一方、安全性タスクにおいては、MADの協調的な洗練は脆弱性を増大させる可能性があるが、多様なエージェント構成を組み込むことで、協調的洗練プロセスを通じて攻撃成功率を段階的に低減できることが示された。我々は、本研究成果が、より効果的かつ戦略的に展開されるMADシステムの将来の開発に向けた重要な指針を提供すると信じる。

English

The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD's performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self-agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD's collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.

マルチエージェント討論をテスト時スケーリングとして再考する：条件付き有効性の体系的検討

Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

要旨

Support