異質智能體協同強化學習

摘要

我們提出異質智能體協作強化學習（HACRL），這一新型學習範式旨在解決孤立同策略優化的效率瓶頸。HACRL實現了「協作優化、獨立執行」的機制：異質智能體在訓練階段共享經過驗證的軌跡數據以實現相互提升，而在推理階段仍保持獨立運行。與基於大語言模型的多智能體強化學習（MARL）不同，HACRL無需協調部署；與同策略/異策略蒸餾技術相比，它支持異質智能體間的雙向相互學習，而非單向的師生知識遷移。基於此範式，我們進一步提出HACPO算法，通過理論保證的軌跡共享機制最大化樣本利用率與跨智能體知識傳遞。為緩解能力差異與策略分佈偏移，HACPO引入四項定制化組件，確保優勢估計的無偏性與優化過程的正確性。在多元異質模型組合與推理基準測試中，HACPO持續提升所有參與智能體性能，平均超越GSPO算法3.3%，同時僅需一半的軌跡採樣成本。

English

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.