대규모 언어 모델 증류 재고찰: 제약 마르코프 결정 과정 관점에서

초록

우리는 대규모 언어 모델(LLM) 증류를 제약 강화 학습 문제로 공식화하여 새로운 접근 방식을 소개합니다. 최근 연구에서는 작업별 보상을 증류 과정에 통합하는 방법을 탐구하기 시작했지만, 기존 방법들은 일반적으로 임시적인 보상 가중치에 의존합니다. 우리는 교사 모델과의 발산을 지정된 임계값 이하로 제약하면서 작업별 보상을 최대화하는 원칙적인 최적화 프레임워크를 제안합니다. 우리의 접근 방식은 증류 설정에 제약 상태 증강 강화 학습을 적용하며, 배포 중에 상태 증강이나 교사 모델 접근이 필요 없고 이중 라그랑지안 방법의 계산 오버헤드 없이도 제약 충족에 대한 이론적 보장을 유지하는 수정된 보상 함수를 도입합니다. 수학적 추론 작업에 대한 광범위한 실험을 통해, 우리의 방법이 소프트 라그랑지안 완화 기준선에 비해 더 나은 제약 충족률과 더 나은 추론 성능을 달성하면서도 경쟁력 있는 작업 성능을 유지함을 입증합니다. 우리의 프레임워크는 자원이 제한된 환경에서 보상을 고려한 증류를 위한 이론적으로 근거 있고 실질적으로 효율적인 해결책을 제공합니다.

English

We introduce a novel approach to large language model (LLM) distillation by formulating it as a constrained reinforcement learning problem. While recent work has begun exploring the integration of task-specific rewards into distillation processes, existing methods typically rely on ad-hoc reward weighting. We propose a principled optimization framework that maximizes task-specific rewards while constraining the divergence from the teacher model to remain below a specified threshold. Our approach adapts constrained state augmented reinforcement learning to the distillation setting, introducing a modified reward function that maintains theoretical guarantees of constraint satisfaction without requiring state augmentation or teacher model access during deployment and without the computational overhead of the dual Lagrangian methods. Through extensive experiments on mathematical reasoning tasks, we demonstrate that our method achieves better constraint satisfaction rates and better reasoning compared to the soft Lagrangian relaxation baselines while maintaining competitive task performance. Our framework provides a theoretically grounded and practically efficient solution for reward-aware distillation in resource-constrained settings.

대규모 언어 모델 증류 재고찰: 제약 마르코프 결정 과정 관점에서

Rethinking Large Language Model Distillation: A Constrained Markov Decision Process Perspective

초록

Support