Destilação On-Policy de Modelos de Linguagem: Aprendendo com Erros Autogerados

Resumo

A destilação de conhecimento (KD, do inglês Knowledge Distillation) é amplamente utilizada para comprimir um modelo professor, reduzindo seu custo de inferência e consumo de memória, ao treinar um modelo estudante menor. No entanto, os métodos atuais de KD para modelos de sequência autorregressivos sofrem com uma incompatibilidade de distribuição entre as sequências de saída observadas durante o treinamento e aquelas geradas pelo estudante durante a inferência. Para resolver esse problema, introduzimos a Destilação de Conhecimento Generalizada (GKD, do inglês Generalized Knowledge Distillation). Em vez de depender exclusivamente de um conjunto fixo de sequências de saída, a GKD treina o estudante em suas próprias sequências de saída geradas, aproveitando o feedback do professor sobre tais sequências. Diferente das abordagens supervisionadas de KD, a GKD também oferece a flexibilidade de empregar funções de perda alternativas entre o estudante e o professor, o que pode ser útil quando o estudante não possui a expressividade necessária para imitar a distribuição do professor. Além disso, a GKD facilita a integração contínua da destilação com o ajuste fino por RL (RLHF, do inglês Reinforcement Learning from Human Feedback). Demonstramos a eficácia da GKD para destilar modelos de linguagem autorregressivos em tarefas de sumarização, tradução e raciocínio aritmético, bem como na destilação independente de tarefas para ajuste por instruções.

English

Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.

Destilação On-Policy de Modelos de Linguagem: Aprendendo com Erros Autogerados

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Resumo

Support