重新审视大语言模型蒸馏：基于约束马尔可夫决策过程的视角

摘要

我们提出了一种新颖的大语言模型（LLM）蒸馏方法，将其构建为一个约束强化学习问题。尽管近期研究已开始探索将任务特定奖励融入蒸馏过程，但现有方法通常依赖于临时性的奖励权重分配。我们提出了一种原则性的优化框架，该框架在最大化任务特定奖励的同时，将学生模型与教师模型的偏离度约束在预设阈值之下。我们的方法将约束状态增强强化学习适配于蒸馏场景，引入了一种改进的奖励函数，该函数在无需状态增强或部署期间访问教师模型的情况下，仍能保持约束满足的理论保证，且避免了双重拉格朗日方法的计算开销。通过在数学推理任务上的广泛实验，我们证明相较于软拉格朗日松弛基线，我们的方法在保持竞争性任务性能的同时，实现了更高的约束满足率和更优的推理能力。我们的框架为资源受限环境下的奖励感知蒸馏提供了一个理论基础坚实且实践高效的解决方案。

English

We introduce a novel approach to large language model (LLM) distillation by formulating it as a constrained reinforcement learning problem. While recent work has begun exploring the integration of task-specific rewards into distillation processes, existing methods typically rely on ad-hoc reward weighting. We propose a principled optimization framework that maximizes task-specific rewards while constraining the divergence from the teacher model to remain below a specified threshold. Our approach adapts constrained state augmented reinforcement learning to the distillation setting, introducing a modified reward function that maintains theoretical guarantees of constraint satisfaction without requiring state augmentation or teacher model access during deployment and without the computational overhead of the dual Lagrangian methods. Through extensive experiments on mathematical reasoning tasks, we demonstrate that our method achieves better constraint satisfaction rates and better reasoning compared to the soft Lagrangian relaxation baselines while maintaining competitive task performance. Our framework provides a theoretically grounded and practically efficient solution for reward-aware distillation in resource-constrained settings.

重新审视大语言模型蒸馏：基于约束马尔可夫决策过程的视角

Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective

摘要

Support