基于基础LLMs的解锁咒语：通过上下文学习重新思考对齐

摘要

大型语言模型（LLMs）的对齐调整过程通常涉及通过监督微调（SFT）进行指导学习和通过来自人类反馈的强化学习进行偏好调整（RLHF）。最近的一项研究，LIMA（Zhou等，2023），表明仅使用1K个示例进行SFT也可以实现显著的对齐性能，这也表明对齐调整的效果可能是“表面的”。这引发了对对齐调整如何确切地转变基础LLM的问题。我们通过检查基础LLMs及其对齐版本之间的标记分布变化来分析对齐调整的效果。我们的研究结果显示，在大多数标记位置上，基础LLMs及其经过对齐调整的版本在解码时表现几乎相同。大多数分布变化发生在风格标记上。这些直接证据强烈支持LIMA提出的“表面对齐假设”。基于这些发现，我们重新思考LLMs的对齐，提出了研究问题：在没有SFT或RLHF的情况下，我们能多有效地对齐基础LLMs吗？为了解决这个问题，我们引入了一种简单的、无需调整的对齐方法，URIAL。URIAL通过与基础LLMs的上下文学习（ICL）实现有效对齐，只需三个固定风格示例和一个系统提示。我们对一个名为JUST-EVAL-INSTRUCT的多样化示例集进行了细致且可解释的评估。结果表明，通过URIAL，基础LLMs可以达到甚至超越通过SFT或SFT+RLHF对齐的LLMs的性能。我们展示了无需调整和基于调整的对齐方法之间的差距可以通过策略提示和ICL显著缩小。我们对对齐调整的表面性质的发现以及URIAL的结果表明，对对齐的深入分析和理论理解对未来LLM研究至关重要。

English

The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

基于基础LLMs的解锁咒语：通过上下文学习重新思考对齐

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

摘要

Support