공동 진화 정제

초록

RLVR과 OPD는 사후 훈련(post-training)의 표준 패러다임으로 자리 잡았습니다. 본 논문은 여러 전문가 능력을 단일 모델로 통합하는 데 있어 이 두 패러다임을 통합적으로 분석하며, 서로 다른 방식의 능력 손실을 규명합니다: 혼합 RLVR은 능력 간 발산 비용(inter-capability divergence cost)을 겪는 반면, 전문가를 먼저 훈련시킨 후 OPD를 수행하는 파이프라인 방식은 발산을 피하지만 교사와 학생 간의 큰 행동 패턴 차이로 인해 교사의 능력을 완전히 흡수하지 못합니다. 우리는 Co-Evolving Policy Distillation(CoPD)을 제안합니다. CoPD는 전문가들의 병렬 훈련을 장려하고, 완전한 전문가 훈련 이후가 아닌 각 전문가의 진행 중인 RLVR 훈련 동안 OPD를 도입하며, 전문가들이 상호 교사 역할을 수행(즉, OPD를 양방향으로 만듦)하여 공동 진화하도록 합니다. 이를 통해 전문가들 간에 보다 일관된 행동 패턴을 유지하면서도 전 과정에 걸쳐 충분한 상보적 지식을 유지할 수 있습니다. 실험을 통해 CoPD가 텍스트, 이미지, 비디오 추론 능력을 올인원(all-in-one)으로 통합하며, 혼합 RLVR이나 MOPD와 같은 강력한 베이스라인을 크게 능가하고, 특정 도메인 전문가조차 넘어서는 성능을 달성함을 입증했습니다. CoPD가 제공하는 모델 병렬 훈련 패턴은 새로운 훈련 확장 패러다임에 대한 영감을 줄 수 있습니다.

English

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.