협력적 단계별 다중 교사 디코딩을 통한 긴 사고 사슬 추론 증류

초록

대규모 추론 모델(Long-CoT)을 실용적으로 만들기 위해서는 증류(distillation)가 필수적이며, 이는 전체 추론(inference)이 계산적으로 여전히 비용이 많이 들기 때문이다. 기존의 큐레이션 기반 접근 방식은 사후적으로 완전한 추론 과정을 선택하지만, 이질적인 교사 모델 간의 협력을 간과하고 동적 탐색이 부족하여 중복 샘플링과 상호 보완적 추론의 누락이 발생한다. 본 논문에서는 CoRD(협력적 다중 교사 디코딩 프레임워크)를 제안하며, 이는 예측적 혼란도 기반 점수와 빔 탐색(beam search)을 활용하여 단계별 추론 합성을 수행한다. 이를 통해 이질적인 대규모 추론 모델이 다양한 잠재 가능성을 효율적으로 유지하면서 일관된 추론 궤적을 공동으로 구성할 수 있다. 실험 결과, CoRD는 더 높은 품질의 추론 데이터를 생성하고, 더 적고 구조화된 지도 신호로 교사 수준에 근접한 학생 모델 성능을 달성하며, 상당한 효율성 오버헤드를 유발하지 않음을 보여준다. 또한 CoRD는 도메인 외 환경과 개방형 설정에서도 잘 일반화된다. 데이터셋과 모델은 https://github.com/DISL-Lab/CoRD에서 확인할 수 있다.

English

Distilling large reasoning models is essential for making Long-CoT reasoning practical, as full-scale inference remains computationally prohibitive. Existing curation-based approaches select complete reasoning traces post-hoc, overlooking collaboration among heterogeneous teachers and lacking dynamic exploration, which leads to redundant sampling and missed complementary reasoning. We introduce CoRD, a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search. This enables heterogeneous LRMs to jointly construct coherent reasoning trajectories while efficiently preserving diverse, high-potential hypotheses. Experiments show that CoRD produces higher-quality reasoning data and achieves near teacher-level student performance with fewer, structured supervision signals, without substantial efficiency overhead. CoRD further generalizes well to out-of-domain and open-ended settings. The dataset and model are available at https://github.com/DISL-Lab/CoRD{https://github.com/DISL-Lab/CoRD}.