始めよければ半分成る：弱から強へのデコーディングによる低リソース環境での選好アラインメント

要旨

大規模言語モデル（LLMs）は、攻撃的、虚偽的、または無意味なコンテンツを生成しないよう、人間の好みに沿った調整（アライメント）が必要です。最近では、低リソースでのLLMアライメント手法が注目されていますが、高品質かつアライメントされたコンテンツを両立させることは依然として課題となっています。デコードの開始時にアライメントされた応答を生成する難しさが集中しているという観察に基づき、我々は新しいフレームワーク「Weak-to-Strong Decoding（WSD）」を提案します。このフレームワークでは、小さなアライメント済みモデルのガイダンスにより、ベースモデルのアライメント能力を向上させます。まず、小さなモデルが適切にアライメントされた開始部分をドラフトし、その後、大規模なベースモデルが残りを続けるというプロセスを、設計された自動切り替えメカニズムで制御します。また、新しいデータセット「GenerAlign」を収集し、Pilot-3Bという小型モデルをドラフトモデルとしてファインチューニングしました。これにより、WSDフレームワーク下で異なるベースモデルの性能が向上し、すべてのベースライン手法を上回りながら、下流タスクでの性能低下（アライメント税）を回避することができました。さらに、さまざまな設定や時間効率の影響を検証するための広範な実験を行い、WSDの内在的なメカニズムについて詳細な分析を行いました。

English

Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.

始めよければ半分成る：弱から強へのデコーディングによる低リソース環境での選好アラインメント

Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding

要旨

Support