GenARM：利用自回歸獎勵模型引導生成，用於測試時間對齊

摘要

大型語言模型（LLMs）展現出令人印象深刻的能力，但需要與人類偏好仔細對齊。傳統的訓練時間方法利用人類偏好數據微調LLMs，但需要巨大的訓練成本，並需要反复訓練以應對多樣的用戶偏好。測試時間對齊方法通過使用獎勵模型（RMs）來引導凍結的LLMs，而無需重新訓練。然而，現有的測試時間方法依賴於軌跡級別的RMs，這些RMs旨在評估完整回應，因此不適用於需要從部分回應計算下一令牌獎勵的自回歸文本生成。為了應對這一問題，我們引入了GenARM，一種測試時間對齊方法，利用自回歸獎勵模型，這是一種新穎的獎勵參數化設計，旨在預測下一令牌獎勵，實現高效和有效的自回歸生成。從理論上講，我們證明了這種參數化可以明確地引導凍結的LLMs走向在KL正則化強化學習框架內傳統RMs可以實現的任何分佈。實驗結果表明，GenARM明顯優於先前的測試時間對齊基線，並與訓練時間方法的性能相匹配。此外，GenARM實現了高效的弱到強引導，將更大的LLMs與更小的RMs對齊，而無需訓練更大的模型所需的高成本。此外，GenARM支持多目標對齊，允許在偏好維度之間進行即時權衡，滿足多樣的用戶偏好而無需重新訓練。

English

Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining.

GenARM：利用自回歸獎勵模型引導生成，用於測試時間對齊

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

摘要

Support