大規模言語モデルの蒸留再考：制約付きマルコフ決定過程の視点から

要旨

大規模言語モデル（LLM）の蒸留に対する新たなアプローチを、制約付き強化学習問題として定式化することで提案する。近年の研究では、タスク固有の報酬を蒸留プロセスに統合する試みが始まっているが、既存の手法は一般的にアドホックな報酬の重み付けに依存している。本論文では、教師モデルからの乖離を指定された閾値以下に制約しつつ、タスク固有の報酬を最大化する原則に基づいた最適化フレームワークを提案する。本手法は、制約付き状態拡張強化学習を蒸留設定に適応させ、展開中に状態拡張や教師モデルへのアクセスを必要とせず、また双対ラグランジュ法の計算オーバーヘッドを伴わずに、制約満足の理論的保証を維持する修正報酬関数を導入する。数学的推論タスクにおける広範な実験を通じて、本手法がソフトラグランジュ緩和ベースラインと比較して、より優れた制約満足率と推論能力を達成しつつ、競争力のあるタスク性能を維持することを実証する。本フレームワークは、リソースが制約された環境における報酬を考慮した蒸留に対して、理論的に裏付けられ、実用的に効率的な解決策を提供する。

English

We introduce a novel approach to large language model (LLM) distillation by formulating it as a constrained reinforcement learning problem. While recent work has begun exploring the integration of task-specific rewards into distillation processes, existing methods typically rely on ad-hoc reward weighting. We propose a principled optimization framework that maximizes task-specific rewards while constraining the divergence from the teacher model to remain below a specified threshold. Our approach adapts constrained state augmented reinforcement learning to the distillation setting, introducing a modified reward function that maintains theoretical guarantees of constraint satisfaction without requiring state augmentation or teacher model access during deployment and without the computational overhead of the dual Lagrangian methods. Through extensive experiments on mathematical reasoning tasks, we demonstrate that our method achieves better constraint satisfaction rates and better reasoning compared to the soft Lagrangian relaxation baselines while maintaining competitive task performance. Our framework provides a theoretically grounded and practically efficient solution for reward-aware distillation in resource-constrained settings.

大規模言語モデルの蒸留再考：制約付きマルコフ決定過程の視点から

Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective

要旨

Support