Destilação On-Policy com Região de Confiança

Resumo

**Destilação On-Policy (OPD)** é uma técnica fundamental para o pós-treinamento eficiente de modelos de linguagem de grande escala (LLMs), com amplas aplicações em aprendizado de agentes, aprimoramento multitarefa e compressão de modelos. No entanto, o treinamento OPD torna-se instável quando as distribuições do professor e do aluno diferem substancialmente, pois a supervisão do professor sobre tokens gerados pelo aluno pode produzir gradientes de política não confiáveis e até causar falha na otimização. Este trabalho aborda a supervisão confiável em nível de token on-policy por meio de estratégias de atribuição de crédito e propõe a *Trust Region On-Policy Distillation*, **TrOPD**. Ela apresenta as seguintes características: 1) **Aprendizado On-Policy em Região de Confiança:** A TrOPD realiza OPD apenas em regiões onde o professor fornece supervisão confiável, mitigando a dificuldade de otimização do estimador K1 de divergência KL reversa sob incompatibilidade de distribuições. 2) **Estimação de Outliers:** Para regiões outliers, exploramos corte de gradiente, mascaramento e estimação KL direta para reduzir os efeitos adversos da supervisão não confiável. 3) **Orientação Fora da Política:** O aluno continua a geração a partir de prefixos do professor e usa KL direto para imitar a orientação fora da política, incentivando a exploração on-policy em direção a regiões confiáveis. Experimentos mostram que a TrOPD supera consistentemente os baselines de OPD estado da arte, incluindo OPD, EOPD e REOPOLD, em raciocínio matemático, geração de código e benchmarks de domínio geral.

English

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.