베이스 LLM의 잠금 해제 주문: 인-컨텍스트 학습을 통한 정렬 재고찰

초록

대규모 언어 모델(LLM)의 정렬 조정 과정은 일반적으로 지도 미세 조정(SFT)을 통한 명령 학습과 인간 피드백 강화 학습(RLHF)을 통한 선호 조정을 포함한다. 최근 연구인 LIMA(Zhou et al. 2023)는 단 1,000개의 예제를 사용한 SFT만으로도 상당한 정렬 성능을 달성할 수 있음을 보여주며, 정렬 조정의 효과가 "표면적"일 가능성을 시사한다. 이는 정렬 조정이 기본 LLM을 어떻게 정확히 변형시키는지에 대한 의문을 제기한다. 우리는 기본 LLM과 정렬 조정된 버전 간의 토큰 분포 변화를 분석하여 정렬 조정의 효과를 조사한다. 연구 결과, 기본 LLM과 정렬 조정된 버전은 대부분의 토큰 위치에서 디코딩 시 거의 동일한 성능을 보인다. 대부분의 분포 변화는 스타일리시한 토큰에서 발생한다. 이러한 직접적인 증거는 LIMA가 제안한 표면적 정렬 가설을 강력히 지지한다. 이러한 발견을 바탕으로, 우리는 SFT나 RLHF 없이 기본 LLM을 얼마나 효과적으로 정렬할 수 있는지라는 연구 질문을 제기하며 LLM 정렬을 재고한다. 이를 해결하기 위해 우리는 간단한 조정 없는 정렬 방법인 URIAL을 소개한다. URIAL은 기본 LLM을 활용한 문맥 학습(ICL)만으로 효과적인 정렬을 달성하며, 단 세 개의 일관된 스타일리시한 예제와 시스템 프롬프트만을 필요로 한다. 우리는 다양한 예제 세트인 JUST-EVAL-INSTRUCT에 대해 세밀하고 해석 가능한 평가를 수행한다. 결과는 URIAL을 적용한 기본 LLM이 SFT 또는 SFT+RLHF로 정렬된 LLM의 성능을 따라잡거나 능가할 수 있음을 보여준다. 우리는 전략적인 프롬프트와 ICL을 통해 조정 없는 정렬 방법과 조정 기반 정렬 방법 간의 격차를 크게 줄일 수 있음을 보여준다. 정렬 조정의 표면적 특성에 대한 우리의 발견과 URIAL의 결과는 정렬에 대한 더 깊은 분석과 이론적 이해가 향후 LLM 연구에 중요함을 시사한다.

English

The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

베이스 LLM의 잠금 해제 주문: 인-컨텍스트 학습을 통한 정렬 재고찰

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

초록

Support