共進化ポリシー蒸留

要旨

RLVRとOPDはポストトレーニングにおける標準的なパラダイムとなっている。本論文では、複数の専門家モデルの能力を単一モデルに統合するこれら二つのパラダイムを統一的に分析し、異なる方法で能力損失が生じることを明らかにする：混合RLVRは能力間の分散コストに悩まされる一方、専門家を先に訓練してからOPDを実施するパイプラインは分散を回避するものの、教師と生徒の間の行動パターンの大きな隔たりにより教師の能力を完全に吸収できない。我々はCo-Evolving Policy Distillation（CoPD）を提案する。これは専門家モデルの並列訓練を促進し、各専門家のRLVR訓練途中でOPDを導入（専門家同士が相互教師となる双方向OPD）することで共進化を実現する。これにより、専門家間の行動パターンの一貫性を保ちつつ、十分な相補的知識を維持できる。実験により、CoPDがテキスト・画像・ビデオ推論能力を全て統合したモデルの実現に成功し、混合RLVRやMOPDなどの強力なベースラインを大幅に上回り、分野特化型専門家モデルさえ凌駕することを検証する。CoPDが提供するモデル並列訓練パターンは、新たな訓練スケーリングパラダイムへの示唆を与える可能性がある。

English

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.