异构智能体协同强化学习

摘要

我们提出异构智能体协同强化学习（HACRL），这一新学习范式旨在解决孤立同策略优化的低效问题。HACRL实现了"训练协同优化，推理独立执行"的机制：异构智能体在训练阶段共享经过验证的轨迹数据以相互提升，而在推理时保持独立运行。与基于大语言模型的多智能体强化学习（MARL）不同，HACRL无需协同部署；与同策略/异策略蒸馏技术相比，它实现了异构智能体间的双向相互学习，而非单向的师生式知识迁移。基于此范式，我们提出HACPO算法，通过理论指导的轨迹共享机制最大化样本利用率和跨智能体知识迁移。为缓解能力差异与策略分布偏移，HACPO引入四项定制化机制，并在无偏优势估计与优化正确性方面提供理论保证。在多样化异构模型组合与推理基准测试中，HACPO持续提升所有参与智能体的性能，平均以仅一半的轨迹成本超越GSPO算法3.3%的表现。

English

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3\% while using only half the rollout cost.