ベースLLMの解呪：インコンテキスト学習によるアライメントの再考

要旨

大規模言語モデル（LLM）のアライメントチューニングプロセスは、通常、教師ありファインチューニング（SFT）による指示学習と、人間のフィードバックからの強化学習（RLHF）による選好チューニングを含みます。最近の研究であるLIMA（Zhou et al. 2023）では、わずか1,000例のSFTでさえも、重要なアライメント性能を達成できることが示されており、アライメントチューニングの効果が「表面的」である可能性を示唆しています。これは、アライメントチューニングがどのようにベースLLMを変換するのかについての疑問を提起します。我々は、ベースLLMとそのアライメントチューニング版の間のトークン分布シフトを分析することで、アライメントチューニングの効果を検証しました。その結果、ベースLLMとそのアライメントチューニング版は、ほとんどのトークン位置においてデコーディングにおいてほぼ同一の性能を示すことが明らかになりました。分布シフトの大部分は、スタイルに関連するトークンで発生しています。これらの直接的な証拠は、LIMAが示唆した「表面的アライメント仮説」を強く支持しています。これらの発見に基づき、我々はSFTやRLHFなしでベースLLMをどの程度効果的にアライメントできるかという研究課題を再考します。これに対処するため、シンプルでチューニング不要のアライメント手法であるURIALを提案します。URIALは、ベースLLMを用いた文脈内学習（ICL）のみを通じて効果的なアライメントを達成し、わずか3つの定型的なスタイル例とシステムプロンプトを必要とします。多様な例セットであるJUST-EVAL-INSTRUCTを用いて、細かく解釈可能な評価を実施しました。その結果、URIALを適用したベースLLMは、SFTやSFT+RLHFでアライメントされたLLMの性能に匹敵し、場合によってはそれを上回ることが示されました。戦略的なプロンプティングとICLを通じて、チューニング不要のアライメント手法とチューニングベースの手法とのギャップを大幅に縮小できることを示しました。アライメントチューニングの表面的な性質に関する我々の発見とURIALの結果は、アライメントの深い分析と理論的理解が将来のLLM研究において重要であることを示唆しています。

English

The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

ベースLLMの解呪：インコンテキスト学習によるアライメントの再考

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

要旨

Support