測試時間偏好優化：透過迭代文本反饋的即時對齊

摘要

大型語言模型（LLMs）展示了令人印象深刻的性能，但缺乏在不重新訓練的情況下快速適應人類偏好的靈活性。在這項工作中，我們引入了測試時間偏好優化（TPO），這是一個框架，在推論期間將LLM的輸出與人類偏好對齊，無需更新模型參數。TPO不僅依賴於純數值獎勵，還將獎勵信號轉化為文本評論，並將其用作文本獎勵來逐步完善其回應。在涵蓋指令遵循、偏好對齊、安全性和數學的基準測試中，TPO逐步改善了與人類偏好的對齊。值得注意的是，在經過僅幾個TPO步驟後，最初未對齊的Llama-3.1-70B-SFT模型可以超越對齊的對應模型Llama-3.1-70B-Instruct。此外，TPO在推論期間與搜索寬度和深度有效地擴展。通過案例研究，我們說明了TPO如何利用LLM解釋和執行獎勵信號的固有能力。我們的研究結果將TPO確立為一個實用、輕量級的測試時間偏好優化替代方案，實現即時對齊。我們的代碼公開可用於https://github.com/yafuly/TPO。

English

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.

測試時間偏好優化：透過迭代文本反饋的即時對齊

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

摘要

Support