LLMに対するサンプル効率の高いアライメント

要旨

大規模言語モデル（LLM）を人間の好みと効率的に整合させる方法について、予算の制約を受けたオンラインフィードバックを考慮して研究しています。まず、LLMの整合化問題を文脈におけるデュエリングバンディットの枠組みで定式化します。この定式化は、オンラインRLHFやオンラインDPOなどの最近のパラダイムを包含し、オンラインアクティブ探索を組み込んだサンプル効率のアルゴリズムを求めるものです。バンディット理論からの示唆を活用し、Thompsonサンプリングに基づく統一されたアルゴリズムを導入し、その応用を2つの異なるLLM整合化シナリオで強調します。このアルゴリズムを効率的に実装する実用的エージェントであるSEA（Sample-Efficient Alignment）は、3つのモデルスケール（1B、2.8B、6.9B）と3つの好み学習アルゴリズム（DPO、IPO、SLiC）を対象とした包括的な実験を通じて経験的に検証されます。その結果、SEAはオラクルの好みと非常にサンプル効率のよい整合性を達成し、LLM向けの最近のアクティブ探索方法を凌駕しています。さらに、LLMのオンライン整合化のために設計された効率的なコードベースとともにSEAの実装を公開し、この分野の将来の研究を加速することを目指しています。

English

We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.

LLMに対するサンプル効率の高いアライメント

Sample-Efficient Alignment for LLMs

要旨

Support