フィルタリング、そして再重み付け：オンポリシー蒸留における最適化の粒度の再考

要旨

大規模言語モデルにおけるオン・ポリシー蒸留（OPD）は、全軌跡KL監視からより選択的な訓練パラダイムへと移行している。最近のOPD手法は、どの軌跡から学習するか、どのトークンが最も情報量が豊富か、どの監視信号が最も信頼できるかの選択にますます焦点を当てている。この傾向に動機づけられ、我々はOPDの最適化粒度を再考し、\fireicon\ FiRe-OPD（Filter, then Reweight）を提案する。これは軌跡レベルとトークンレベルの両方で監視信号を共同で調整する。詳細には、FiRe-OPDはまず軌跡をフィルタリングして低品質のロールアウトサンプルを除去し、次に保持された軌跡内でソフト再重み付けを適用して情報量の多いトークンを強調する。ハードトークン選択と比較して、FiRe-OPDはソフト重み付けメカニズムを活用し、情報損失を効果的に軽減し、最適化の安定性を向上させることで、より細かい粒度のOPD最適化を実現する。我々はFiRe-OPDの有効性を、強から弱への設定、単一教師設定、複数教師設定にわたって検証し、最近のトークンレベルのOPD手法に対する優位性を示す（例えば、強から弱への設定でAIME 2024において+6.25、複数教師設定でMinerにおいて+18.81）。我々のコードは https://github.com/YuYingLi0/FiRe-OPD で入手可能である。

English

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.