DeAL: 大規模言語モデルのためのデコード時アラインメント

要旨

大規模言語モデル（LLM）は、現在、人間の好みに沿ったコンテンツを生成することが期待されています。現在の研究は、人間のフィードバックを用いた強化学習（RLHF）などの技術を通じて、モデルの訓練段階でのアライメントに焦点を当てています。しかし、このような方法がモデルにアライメント目標を教えるための効果的な選択肢であるかどうかは不明です。第一に、複数のカスタム報酬を組み込むことができないことや、モデル開発者の普遍的な静的な原則に依存することが主要な制限です。第二に、モデル訓練における残存ギャップや、そのようなアプローチの信頼性も疑問視されています（例えば、安全性訓練後でもジャイルブレイクに対する脆弱性）。これらに対処するため、我々はDeALを提案します。これは、ユーザーが報酬関数をカスタマイズし、LLMのデコード時アライメント（DeAL）を可能にするフレームワークです。その核心として、デコードをヒューリスティックに導かれた探索プロセスと見なし、多様なアライメント目標の使用を促進します。キーワードや長さの制約（LLM以前の時代に広く研究された）や、無害性や有用性（LLM以後の時代に提案された）といった抽象的な目標を用いた実験では、細かいトレードオフを扱い、アライメント目標への適合性を向上させ、LLMの残存ギャップに対処できることを示しています。最後に、DeALはRLHFやプロンプト技術と効果的に組み合わせることができますが、その汎用性によりデコードが遅くなるという最適化は今後の課題として残されています。

English

Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. Current work focuses on alignment at model training time, through techniques such as Reinforcement Learning with Human Feedback (RLHF). However, it is unclear if such methods are an effective choice to teach alignment objectives to the model. First, the inability to incorporate multiple, custom rewards and reliance on a model developer's view of universal and static principles are key limitations. Second, the residual gaps in model training and the reliability of such approaches are also questionable (e.g. susceptibility to jail-breaking even after safety training). To address these, we propose DeAL, a framework that allows the user to customize reward functions and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view decoding as a heuristic-guided search process and facilitate the use of a wide variety of alignment objectives. Our experiments with programmatic constraints such as keyword and length constraints (studied widely in the pre-LLM era) and abstract objectives such as harmlessness and helpfulness (proposed in the post-LLM era) show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs. Lastly, while DeAL can be effectively paired with RLHF and prompting techniques, its generality makes decoding slower, an optimization we leave for future work.

DeAL: 大規模言語モデルのためのデコード時アラインメント

DeAL: Decoding-time Alignment for Large Language Models

要旨

Support