协同演化策略精馏

摘要

RLVR与OPD已成为后训练的标准范式。本文对这两种范式在融合多专家能力至单一模型中的表现进行了统一分析，发现能力损失呈现不同模式：混合RLVR存在能力间发散代价，而先训练专家再执行OPD的流程虽避免发散，却因师生行为模式差异过大而无法完全吸收教师能力。我们提出协同进化策略蒸馏（CoPD），通过并行训练专家并在各专家RLVR训练过程中（而非完整训练后）引入OPD，使专家互为教师（实现双向OPD）以协同进化。该方法在保持充足互补知识的同时，使专家间行为模式更趋一致。实验验证CoPD可实现文本、图像与视频推理能力的全栈整合，显著超越混合RLVR、MOPD等强基线，甚至优于领域专用专家。CoPD提供的模型并行训练模式或可启发新型训练扩展范式。

English

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.