テスト時の選好最適化：反復的なテキストフィードバックを介した即座のアラインメント

要旨

大規模言語モデル（LLM）は印象的な性能を示すが、再トレーニングなしに迅速に人間の好みに適応する柔軟性に欠けています。本研究では、推論中に人間の好みとLLMの出力を整合させ、モデルパラメータを更新する必要がないTest-time Preference Optimization（TPO）フレームワークを紹介します。純粋に数値的な報酬に頼るのではなく、TPOは報酬信号をテキストの批評に変換し、それらをテキストの報酬として使用して応答を反復的に改良します。指示の遵守、好みの整合性、安全性、数学をカバーするベンチマークでの評価により、TPOは徐々に人間の好みと整合性を向上させます。特筆すべきは、わずか数回のTPOステップの後、最初は整合していなかったLlama-3.1-70B-SFTモデルが整合した対応モデルであるLlama-3.1-70B-Instructを上回ることができることです。さらに、TPOは推論中の探索幅と深さの両方で効率的にスケーリングします。事例研究を通じて、TPOがLLMが報酬信号を解釈し、それに応じて行動する能力を活用する方法を説明します。我々の研究成果は、テスト時の好み最適化のための実用的で軽量な代替手段としてTPOを確立し、リアルタイムでの整合性を達成します。我々のコードはhttps://github.com/yafuly/TPO で公開されています。

English

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.

テスト時の選好最適化：反復的なテキストフィードバックを介した即座のアラインメント

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

要旨

Support