GenARM：報酬誘導生成における自己回帰報酬モデルによるテスト時アラインメント

要旨

大規模言語モデル（LLMs）は印象的な能力を示すが、人間の好みとの注意深い整合が必要です。従来のトレーニング時の手法は、人間の好みのデータセットを使用してLLMsを微調整しますが、膨大なトレーニングコストがかかり、さまざまなユーザーの好みを扱うために繰り返しトレーニングが必要です。テスト時の整合手法は、凍結されたLLMsを再トレーニングせずにガイドするために報酬モデル（RMs）を使用することでこれを解決します。ただし、既存のテスト時のアプローチは、完全な応答を評価するために設計された軌道レベルのRMsに依存しており、部分的な応答から次のトークンの報酬を計算する必要がある自己回帰テキスト生成には適していません。この問題に対処するために、私たちはGenARMを導入します。これは、自己回帰報酬モデルを活用するテスト時の整合アプローチであり、効率的かつ効果的な自己回帰生成のために設計された新しい報酬パラメータ化です。理論的には、このパラメータ化がKL正則化強化学習フレームワーク内で従来のRMsによって達成可能な任意の分布に凍結されたLLMsを確実にガイドできることを示します。実験結果は、GenARMが従来のテスト時の整合ベースラインを大幅に上回り、トレーニング時の手法と同等のパフォーマンスを発揮することを示しています。さらに、GenARMは、大きなLLMsを小さなRMsと整合させるための高いトレーニングコストなしに効率的な弱から強いガイダンスを実現します。さらに、GenARMは、異なるユーザーの好みに再トレーニングすることなく、好みの次元間でリアルタイムのトレードオフを可能にする多目的整合をサポートします。

English

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining.

GenARM：報酬誘導生成における自己回帰報酬モデルによるテスト時アラインメント

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

要旨

Support