El Hechizo Desbloqueador en los Modelos de Lenguaje Base: Replanteando la Alineación mediante el Aprendizaje en Contexto

Resumen

El proceso de ajuste de alineación de los modelos de lenguaje grandes (LLMs, por sus siglas en inglés) generalmente implica el aprendizaje por instrucciones mediante ajuste fino supervisado (SFT) y el ajuste de preferencias a través del aprendizaje por refuerzo con retroalimentación humana (RLHF). Un estudio reciente, LIMA (Zhou et al. 2023), muestra que utilizando apenas 1,000 ejemplos para SFT también se puede lograr un rendimiento significativo en la alineación, lo que sugiere que el efecto del ajuste de alineación podría ser "superficial". Esto plantea preguntas sobre cómo exactamente el ajuste de alineación transforma un LLM base. Analizamos el efecto del ajuste de alineación examinando el cambio en la distribución de tokens entre los LLMs base y sus versiones alineadas. Nuestros hallazgos revelan que los LLMs base y sus versiones ajustadas para alineación se desempeñan de manera casi idéntica en la decodificación en la mayoría de las posiciones de tokens. La mayoría de los cambios en la distribución ocurren con tokens estilísticos. Esta evidencia directa respalda firmemente la Hipótesis de Alineación Superficial sugerida por LIMA. Basándonos en estos hallazgos, reconsideramos la alineación de los LLMs planteando la pregunta de investigación: ¿qué tan efectivamente podemos alinear LLMs base sin SFT o RLHF? Para abordar esto, introducimos un método simple y sin ajuste para la alineación, llamado URIAL. URIAL logra una alineación efectiva únicamente a través del aprendizaje en contexto (ICL) con LLMs base, requiriendo tan solo tres ejemplos estilísticos constantes y un mensaje de sistema. Realizamos una evaluación detallada e interpretable en un conjunto diverso de ejemplos, denominado JUST-EVAL-INSTRUCT. Los resultados demuestran que los LLMs base con URIAL pueden igualar o incluso superar el rendimiento de los LLMs alineados con SFT o SFT+RLHF. Mostramos que la brecha entre los métodos de alineación sin ajuste y con ajuste puede reducirse significativamente mediante estrategias de indicación e ICL. Nuestros hallazgos sobre la naturaleza superficial del ajuste de alineación y los resultados con URIAL sugieren que un análisis más profundo y una comprensión teórica de la alineación son cruciales para la investigación futura de LLMs.

English

The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

El Hechizo Desbloqueador en los Modelos de Lenguaje Base: Replanteando la Alineación mediante el Aprendizaje en Contexto

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Resumen

Support