オン方策蒸留のための信頼領域行動混合

要旨

オン方策蒸留（OPD）は、自身の方策からサンプリングされたプレフィックスに対して生徒モデルを訓練し、より強力な教師モデルに一致させる手法である。これによりオフライン蒸留におけるプレフィックスの不一致に対処できるが、初期の生徒のロールアウトは依然として質が低く、教師による監督が弱いまたは低品質のプレフィックスに適用される可能性がある。本稿では、信頼領域行動ブレンディング（TRB）を提案する。これはウォームアップ手法であり、初期のロールアウト方策を、生徒中心のKL信頼領域内で教師に最も近い行動方策に置き換える一方、プレフィックスごとの逆KLのOPD損失は変更しない。KLの予算はゼロに向けてアニーリングされるため、ウォームアップ後は訓練が純粋な生徒のロールアウトに戻る。2つの数理論理的推論蒸留設定において、TRBは比較手法の中で最も高い平均性能を達成した。

English

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.