デコード時アラインメントのための進化的ヒューリスティクスを用いた報酬誘導

要旨

大規模言語モデル（LLM）の広範な適用性とますます普及する存在感は、ユーザーやステークホルダーの嗜好に合わせたLLMの応答を調整する必要性を引き起こしています。多くの嗜好最適化アプローチが提案されており、LLMのパラメータを微調整して良好な整合性を達成しようとしています。しかし、そのようなパラメータ調整は、多くのタスクにおけるモデルのパフォーマンスに干渉することが知られています。さらに、変化するユーザーの嗜好に対応することは、このような状況では難しいです。デコード時の報酬モデルガイダンスによる整合性は、推論時間の増加という代償を払ってこれらの問題を解決します。しかし、そのような方法の多くは、探索と報酬の活用の適切なバランスを取ることができず、しばしばこれら二つの側面が混同された定式化のために、十分に整合した応答を提供することができません。これを改善するために、私たちはこれら二つの側面を分離し、進化的な方法で実装します：探索は変異した指示からのデコードによって強制され、活用は報酬の低い生成を報酬の高い生成に定期的に置き換えることで表現されます。実証的な証拠は、この戦略が広く受け入れられている整合性ベンチマークであるAlpacaEval 2とMT-Benchにおいて、多くの嗜好最適化およびデコード時整合性アプローチを上回ることを示しています。私たちの実装は以下で利用可能です：https://darwin-alignment.github.io。

English

The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.

デコード時アラインメントのための進化的ヒューリスティクスを用いた報酬誘導

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

要旨

Support