協調的な段階的マルチティーチャーデコーディングを用いた長いCoT推論の蒸留

要旨

大規模推論モデルの蒸留は、Long-CoT推論を実用的にするために不可欠である。なぜなら、完全な規模での推論は計算的に依然として非現実的だからである。既存のキュレーションベースの手法は、完全な推論軌跡を事後的に選択するものであり、異種教師間の協調を見落とし、動的探索を欠いている。その結果、冗長なサンプリングと相補的推論の欠落が生じる。本稿では、予測困惑度スコアリングとビーム探索に基づくステップワイズな推論合成を実行する協調型マルチ教師デコーディングフレームワークであるCoRDを提案する。これにより、異種の大規模推論モデル（LRM）が協調して一貫性のある推論軌跡を構築し、多様で有望な仮説を効率的に保持できる。実験により、CoRDはより高品質な推論データを生成し、効率性の大きなオーバーヘッドなしに、より少ない構造化された教師信号で、教師に近いレベルの学生パフォーマンスを達成することを示す。さらに、CoRDはドメイン外やオープンエンドの設定にも良好に一般化する。データセットとモデルはhttps://github.com/DISL-Lab/CoRDで公開している。

English

Distilling large reasoning models is essential for making Long-CoT reasoning practical, as full-scale inference remains computationally prohibitive. Existing curation-based approaches select complete reasoning traces post-hoc, overlooking collaboration among heterogeneous teachers and lacking dynamic exploration, which leads to redundant sampling and missed complementary reasoning. We introduce CoRD, a collaborative multi-teacher decoding framework that performs step-wise reasoning synthesis guided by predictive perplexity-based scoring and beam search. This enables heterogeneous LRMs to jointly construct coherent reasoning trajectories while efficiently preserving diverse, high-potential hypotheses. Experiments show that CoRD produces higher-quality reasoning data and achieves near teacher-level student performance with fewer, structured supervision signals, without substantial efficiency overhead. CoRD further generalizes well to out-of-domain and open-ended settings. The dataset and model are available at https://github.com/DISL-Lab/CoRD{https://github.com/DISL-Lab/CoRD}.