Sobre el Hacking del Docente en la Destilación de Modelos de Lenguaje

Resumen

El post-entrenamiento de modelos de lenguaje (LMs) depende cada vez más de las siguientes dos etapas: (i) destilación de conocimiento, donde el LM se entrena para imitar a un LM profesor más grande, y (ii) aprendizaje por refuerzo a partir de retroalimentación humana (RLHF), donde el LM se alinea optimizando un modelo de recompensa. En la segunda etapa de RLHF, un desafío bien conocido es el hacking de recompensa, donde el LM sobre-optimiza el modelo de recompensa. Este fenómeno está en línea con la ley de Goodhart y puede llevar a un rendimiento degradado en el objetivo real. En este artículo, investigamos si un fenómeno similar, al que llamamos hacking de profesor, puede ocurrir durante la destilación de conocimiento. Esto podría surgir porque el LM profesor es en sí misma una aproximación imperfecta de la verdadera distribución. Para estudiar esto, proponemos una configuración experimental controlada que involucra: (i) un LM oráculo que representa la distribución de verdad, (ii) un LM profesor destilado del oráculo, y (iii) un LM estudiante destilado del profesor. Nuestros experimentos revelan las siguientes percepciones. Cuando se utiliza un conjunto de datos fijos sin conexión para la destilación, ocurre el hacking de profesor; además, podemos detectarlo observando cuándo el proceso de optimización se desvía de las leyes de convergencia polinomial. En contraste, el empleo de técnicas de generación de datos en línea mitiga efectivamente el hacking de profesor. Más precisamente, identificamos la diversidad de datos como el factor clave para prevenir el hacking. En general, nuestros hallazgos proporcionan una comprensión más profunda de los beneficios y limitaciones de la destilación para construir LMs robustos y eficientes.

English

Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model. Such phenomenon is in line with Goodhart's law and can lead to degraded performance on the true objective. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust and efficient LMs.

Sobre el Hacking del Docente en la Destilación de Modelos de Lenguaje

On Teacher Hacking in Language Model Distillation

Resumen

Support