EVOCHAMBER: 개인, 팀, 집단 규모에서의 테스트 시간 다중 에이전트 시스템 공진화

초록

우리는 다중 에이전트 테스트-타임 진화가 단일 에이전트 진화를 N번 반복한 것이 아님을 주장한다. 단일 에이전트 학습자는 자신의 맥락과 메모리만을 진화시킬 수 있다. 반면 다중 에이전트 시스템은 누가 협력하는지, 어떻게 협력하는지, 지식이 집단 내에서 어떻게 흐르는지까지 추가로 진화시킨다. 이러한 구성 요소들은 단일 에이전트에서 대응되는 것이 없으며, 창발적 전문화와 같은 현상을 생성할 수 있다. 그러나 기존의 테스트-타임 방법들은 경험을 개별 에이전트에 국한시켜 에이전트 간 학습을 포기하거나, 모든 에이전트에 대칭적으로 브로드캐스트하여 협력을 가치 있게 만드는 전문화를 지워버린다. 본 논문에서는 훈련 없는 프레임워크인 EVOCHAMBER를 제시한다. 이는 공진화하는 에이전트 풀 위에서 세 가지 수준의 테스트-타임 진화를 구현한다. 핵심은 CODREAM(협력적 꿈꾸기)으로, 팀 실패나 의견 불일치 시 작동하는 작업 후 프로토콜이다. 이 프로토콜에서 에이전트들은 협력적으로 반성하고 통찰을 추출하며, 실패한 틈새에서 강한 에이전트에서 약한 에이전트로 비대칭적으로 지식을 전달하여 전문화를 유지하면서 지식 격차를 메운다. 팀 수준 연산자는 틈새 조건에 맞는 팀을 구성하고 협력 구조를 온라인으로 선택한다. 집단 수준 생애주기 연산자는 성능 압박 하에서 에이전트를 분기, 병합, 가지치기 및 시드한다. Qwen3-8B를 사용한 세 가지 이기종 작업 스트림에서 EVOCHAMBER는 경쟁 수학 63.9%, 코드 75.7%, 다중 영역 추론 87.1%를 달성하여, 최고 기준선 대비 수학에서 상대적으로 32% 더 나은 성능을 보였으며, 절제 실험을 통해 비대칭적 에이전트 간 전이가 주요 동인임을 확인했다. 여러 개의 동일하게 초기화된 에이전트로 시작하여 4~5개의 안정적인 틈새 전문가가 자발적으로 출현하는데, 이는 단일 에이전트 학습자가 표현할 수 없는 다중 에이전트 진화의 구조적 징후이다. 코드는 https://github.com/Mercury7353/EvoChamber에서 확인할 수 있다.

English

We argue that multi-agent test-time evolution is not single-agent evolution replicated N times. A single-agent learner can only evolve its own context and memory. A multi-agent system additionally evolves who collaborates, how they collaborate, and how knowledge flows across the population. These components have no single-agent counterpart and can produce phenomena such as emergent specialization. Yet prior test-time methods either confine experiences to individual agents, forfeiting cross-agent learning, or broadcast symmetrically to all agents, erasing the specialization that makes collaboration valuable. We present EVOCHAMBER, a training-free framework that instantiates test-time evolution at three levels over a coevolving agent pool. At its core is CODREAM (Collaborative Dreaming), a post-task protocol triggered on team failure or disagreement, in which agents collaboratively reflect, distill insights, and route them asymmetrically from strong to weak agents on the failed niche, preserving specialization while filling knowledge gaps. Team-level operators assemble niche-conditioned teams and select collaboration structures online. Population-level lifecycle operators fork, merge, prune, and seed agents under performance pressure. On three heterogeneous task streams with Qwen3-8B, EVOCHAMBER reaches 63.9% on competition math, 75.7% on code, and 87.1% on multi-domain reasoning, outperforming the best baseline by 32% relative on math and confirming asymmetric cross-agent transfer as the primary driver in ablation. Starting from several identically initialized agents, four to five stable niche specialists spontaneously emerge, a structural signature of multi-agent evolution that no single-agent learner can express. See our code at: https://github.com/Mercury7353/EvoChamber

EVOCHAMBER: 개인, 팀, 집단 규모에서의 테스트 시간 다중 에이전트 시스템 공진화

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

초록

Support