基於上下文學習重新思考對齊的解鎖法術

摘要

大型語言模型（LLMs）的對齊調整過程通常包括通過監督微調（SFT）進行指導學習，以及通過從人類反饋中進行強化學習進行偏好調整（RLHF）。最近的一項研究，LIMA（Zhou等，2023年），表明僅使用1K個示例進行SFT也可以實現顯著的對齊性能，這表明對齊調整的影響可能是“表面的”。這引發了對對齊調整如何確切地轉換基礎LLM的問題。我們通過檢查基礎LLMs和它們的對齊版本之間的標記分佈變化來分析對齊調整的效果。我們的研究發現，基礎LLMs和它們的對齊調整版本在大多數標記位置上的解碼表現幾乎相同。大多數分佈變化發生在風格標記上。這些直接證據強烈支持LIMA提出的“表面對齊假設”。基於這些發現，我們重新思考LLMs的對齊，提出研究問題：在沒有SFT或RLHF的情況下，我們可以多有效地對齊基礎LLMs嗎？為了解決這個問題，我們引入了一種簡單的、無需調整的對齊方法，即URIAL。URIAL僅通過基礎LLMs的上下文學習（ICL）實現有效的對齊，只需要三個恆定的風格示例和一個系統提示。我們對一組多樣化的示例進行了細緻且可解釋的評估，名為JUST-EVAL-INSTRUCT。結果表明，使用URIAL的基礎LLMs可以達到甚至超越通過SFT或SFT+RLHF對齊的LLMs的性能。我們表明，通過策略提示和ICL，無需調整和基於調整的對齊方法之間的差距可以得到顯著縮小。我們對對齊調整的表面性質的研究結果以及URIAL的結果表明，對對未來LLM研究至關重要的是進行更深入的分析和理論理解。

English

The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

基於上下文學習重新思考對齊的解鎖法術

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

摘要

Support