GenARM:利用自回歸獎勵模型引導生成,用於測試時間對齊
GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
October 10, 2024
作者: Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh
cs.AI
摘要
大型語言模型(LLMs)展現出令人印象深刻的能力,但需要與人類偏好仔細對齊。傳統的訓練時間方法利用人類偏好數據微調LLMs,但需要巨大的訓練成本,並需要反复訓練以應對多樣的用戶偏好。測試時間對齊方法通過使用獎勵模型(RMs)來引導凍結的LLMs,而無需重新訓練。然而,現有的測試時間方法依賴於軌跡級別的RMs,這些RMs旨在評估完整回應,因此不適用於需要從部分回應計算下一令牌獎勵的自回歸文本生成。為了應對這一問題,我們引入了GenARM,一種測試時間對齊方法,利用自回歸獎勵模型,這是一種新穎的獎勵參數化設計,旨在預測下一令牌獎勵,實現高效和有效的自回歸生成。從理論上講,我們證明了這種參數化可以明確地引導凍結的LLMs走向在KL正則化強化學習框架內傳統RMs可以實現的任何分佈。實驗結果表明,GenARM明顯優於先前的測試時間對齊基線,並與訓練時間方法的性能相匹配。此外,GenARM實現了高效的弱到強引導,將更大的LLMs與更小的RMs對齊,而無需訓練更大的模型所需的高成本。此外,GenARM支持多目標對齊,允許在偏好維度之間進行即時權衡,滿足多樣的用戶偏好而無需重新訓練。
English
Large Language Models (LLMs) exhibit impressive capabilities but require
careful alignment with human preferences. Traditional training-time methods
finetune LLMs using human preference datasets but incur significant training
costs and require repeated training to handle diverse user preferences.
Test-time alignment methods address this by using reward models (RMs) to guide
frozen LLMs without retraining. However, existing test-time approaches rely on
trajectory-level RMs which are designed to evaluate complete responses, making
them unsuitable for autoregressive text generation that requires computing
next-token rewards from partial responses. To address this, we introduce
GenARM, a test-time alignment approach that leverages the Autoregressive Reward
Model--a novel reward parametrization designed to predict next-token rewards
for efficient and effective autoregressive generation. Theoretically, we
demonstrate that this parametrization can provably guide frozen LLMs toward any
distribution achievable by traditional RMs within the KL-regularized
reinforcement learning framework. Experimental results show that GenARM
significantly outperforms prior test-time alignment baselines and matches the
performance of training-time methods. Additionally, GenARM enables efficient
weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high
costs of training larger models. Furthermore, GenARM supports multi-objective
alignment, allowing real-time trade-offs between preference dimensions and
catering to diverse user preferences without retraining.Summary
AI-Generated Summary