테스트 타임 스케일링으로서의 다중 에이전트 토론 재고찰: 조건부 효과성에 대한 체계적 연구

초록

대형 언어 모델(LLM)의 능력이 눈부시게 발전함에 따라, 다중 에이전트 시스템에 대한 탐구가 활발히 이루어지고 있으며, 이 중에서도 논쟁 프레임워크가 향상된 문제 해결을 위한 유망한 접근법으로 부상하고 있다. 다중 에이전트 논쟁(MAD) 접근법은 에이전트들이 협력적으로 주장을 제시하고, 비판하며, 개선하는 과정을 통해 단일 모델 대비 향상된 추론 능력, 견고성, 그리고 다양한 관점을 제공할 잠재력을 가지고 있다. 그러나 기존 연구들이 MAD를 활용해 왔음에도 불구하고, 특히 다양한 조건 하에서 자기 에이전트 방법과 비교했을 때 MAD의 효과에 대한 체계적인 이해는 여전히 부족한 상태이다. 본 논문은 이러한 격차를 메우기 위해 MAD를 협력적 개선과 다양한 탐색 능력을 특징으로 하는 테스트 시간 계산 확장 기법으로 개념화한다. 우리는 수학적 추론 및 안전 관련 작업에서 MAD와 강력한 자기 에이전트 테스트 시간 확장 기준선을 비교하는 포괄적인 실증적 연구를 수행한다. 본 연구는 작업 난이도, 모델 규모, 그리고 에이전트 다양성이 MAD의 성능에 미치는 영향을 체계적으로 조사한다. 주요 연구 결과에 따르면, 수학적 추론의 경우 MAD는 자기 에이전트 확장에 비해 제한된 이점을 제공하지만, 문제 난이도가 증가하고 모델 능력이 감소할수록 더 효과적이 되는 반면, 에이전트 다양성은 거의 이점을 보이지 않는다. 반대로, 안전 작업의 경우 MAD의 협력적 개선은 취약성을 증가시킬 수 있지만, 다양한 에이전트 구성을 통합함으로써 협력적 개선 과정을 통해 공격 성공률을 점진적으로 감소시킬 수 있다. 우리는 본 연구 결과가 보다 효과적이고 전략적으로 배치된 MAD 시스템의 미래 개발을 위한 중요한 지침을 제공할 것이라고 믿는다.

English

The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD's performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self-agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD's collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.

테스트 타임 스케일링으로서의 다중 에이전트 토론 재고찰: 조건부 효과성에 대한 체계적 연구

Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

초록

Support