ChatPaper.aiChatPaper

GenARM:奖励引导生成,利用自回归奖励模型进行测试时对齐

GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment

October 10, 2024
作者: Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh
cs.AI

摘要

大型语言模型(LLMs)展现出令人印象深刻的能力,但需要与人类偏好进行仔细对齐。传统的训练时方法利用人类偏好数据集对LLMs进行微调,但会产生显著的训练成本,并需要反复训练以处理不同用户偏好。测试时对齐方法通过使用奖励模型(RMs)指导冻结的LLMs而无需重新训练来解决这个问题。然而,现有的测试时方法依赖于轨迹级别的RMs,这些RMs旨在评估完整的响应,因此不适用于需要从部分响应计算下一个标记奖励的自回归文本生成。为了解决这个问题,我们引入了GenARM,一种测试时对齐方法,利用自回归奖励模型——一种设计用于预测下一个标记奖励以实现高效和有效的自回归生成的新颖奖励参数化。从理论上讲,我们证明了这种参数化可以明确地指导冻结的LLMs朝着在KL正则化强化学习框架内传统RMs可以实现的任何分布。实验结果表明,GenARM明显优于先前的测试时对齐基线,并与训练时方法的性能相匹配。此外,GenARM实现了高效的弱到强引导,将更大的LLMs与更小的RMs对齐,而无需训练更大的模型的高成本。此外,GenARM支持多目标对齐,允许在偏好维度之间进行实时权衡,并满足不同用户偏好而无需重新训练。
English
Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model--a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining.

Summary

AI-Generated Summary

PDF42November 16, 2024