スペキュラティブリジェクションを用いた高速なベストオブNデコーディング

要旨

大規模言語モデル（LLM）の安全で効果的な展開には、モデルの応答が人間の選好に一致するようにするための重要なステップであるアラインメントが関わっています。DPO、PPOなどの一般的なアラインメント手法は、事前学習済みモデルの重みを変更することでLLMをアラインメントさせる、ポストトレーニングと呼ばれる段階で行われます。主流であるこれらのポストトレーニング手法は、LLMを展開する前に複雑さを増加させます。推論時のアラインメント手法は、複雑なポストトレーニング手順を回避し、代わりに人間の選好に合致する応答にバイアスをかけます。最もよく知られた推論時のアラインメント手法であるBest-of-Nは、最先端のポストトレーニング手法と同等に効果的です。残念ながら、Best-of-Nは標準のデコーディング戦略よりもはるかに多くのリソースを推論時に必要とし、計算上実行不可能となります。本研究では、計算上実行可能な推論時のアラインメントアルゴリズムであるSpeculative Rejectionを紹介します。Best-of-Nと同様に、与えられた報酬モデルに従って高得点の応答を生成しますが、計算上は16から32倍効率的です。

English

The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.

スペキュラティブリジェクションを用いた高速なベストオブNデコーディング

Fast Best-of-N Decoding via Speculative Rejection

要旨

Support