이질적 에이전트 협력 강화 학습

초록

본 논문에서는 분리된 온-정책 최적화의 비효율성을 해결하는 새로운 학습 패러다임인 이기종 에이전트 협력 강화 학습(Heterogeneous Agent Collaborative Reinforcement Learning, HACRL)을 소개한다. HACRL은 독립적 실행과 협력적 최적화를 동시에 가능하게 한다: 이기종 에이전트들은 훈련 중 검증된 롤아웃을 공유하여 상호 개선하지만, 추론 시점에서는 독립적으로 운영된다. LLM 기반 다중 에이전트 강화 학습(MARL)과 달리 HACRL은 협력적 배포를 필요로 하지 않으며, 온/오프-정책 지식 증류와 달리 이기종 에이전트 간 단방향 교사-학생 전달이 아닌 쌍방향 상호 학습을 가능하게 한다. 이 패러다임을 기반으로, 본 연구는 표본 활용도와 에이전트 간 지식 전달을 극대화하기 위한 원칙적인 롤아웃 공유가 가능한 협력적 RL 알고리즘인 HACPO를 제안한다. 능력 격차와 정책 분포 변화를 완화하기 위해 HACPO는 편향되지 않은 어드밴티지 추정과 최적화 정확성에 대한 이론적 보장을 지닌 네 가지 특화 메커니즘을 도입한다. 다양한 이기종 모델 조합과 추론 벤치마크에서의 광범위한 실험을 통해 HACPO가 모든 참여 에이전트를 지속적으로 개선하며, 롤아웃 비용의 절반만 사용하면서 GSPO 대비 평균 3.3% 우수한 성능을 보임을 입증하였다.

English

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.