共同演化策略蒸餾

摘要

RLVR與OPD已成為後訓練的標準範式。本文對這兩種範式在整合多專家能力至單一模型中的表現進行統一分析，發現能力損失的不同成因：混合RLVR受制於能力間發散成本，而先訓練專家再執行OPD的流程雖避免發散，卻因師生行為模式差距過大無法完全吸收教師能力。我們提出協同演化策略蒸餾法（CoPD），通過鼓勵專家並行訓練，並在每個專家的RLVR訓練過程中（而非完整訓練後）引入OPD，使專家互為教師（實現雙向OPD）以協同演化。這使得專家間行為模式更一致，同時保持充足的互補知識。實驗驗證CoPD實現了文本、圖像與視頻推理能力的全整合，顯著超越混合RLVR和MOPD等強基線，甚至優於領域專用專家。CoPD提供的模型並行訓練模式或可啟發新型訓練擴展範式。

English

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.