過濾，再重新加權：重新思考同策略蒸餾中的優化粒度

摘要

大型語言模型中的同策略蒸餾（OPD）正從全面追蹤KL散度監管轉向更具選擇性的訓練範式。近年來的OPD方法日益聚焦於選擇哪些軌跡值得學習、哪些標記最具資訊量，以及哪些監管訊號最可靠。受此趨勢啟發，我們重新審視OPD的優化粒度，提出\fireicon\ FiRe-OPD（過濾後重新加權），該方法在軌跡與標記層級聯合調整監管訊號。具體而言，FiRe-OPD首先過濾軌跡以移除低品質的生成樣本，隨後在保留的軌跡內應用軟性重新加權機制，以強化資訊量豐富的標記。相較於硬性標記選擇，FiRe-OPD透過軟性加權機制有效減輕資訊損失並提升優化穩定性，從而實現更細粒度的OPD優化。我們在強到弱、單教師及多教師設定下驗證了FiRe-OPD的有效性，並展示其相較於近期標記層級OPD方法的優越性（例如：在強到弱設定中於AIME 2024提升6.25分，在多教師設定中於Miner提升18.81分）。我們的程式碼已公開於 https://github.com/YuYingLi0/FiRe-OPD。

English

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.