《以一适众:基于元奖励模型的个性化大语言模型对齐方法》
One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment
January 26, 2026
作者: Hongru Cai, Yongqi Li, Tiezheng Yu, Fengbin Zhu, Wenjie Wang, Fuli Feng, Wenjie Li
cs.AI
摘要
大型语言模型的对齐旨在使模型输出符合人类偏好,而个性化对齐则进一步使模型适配个体用户需求。这依赖于能够捕捉用户特定偏好并自动提供个性化反馈的奖励模型。然而,开发此类模型面临两大挑战:个体用户反馈数据的稀缺性以及模型对新用户的高效适应需求。我们认为解决这些约束需要实现从拟合数据学习用户偏好到学习偏好适应过程的范式转变。为此,我们提出元奖励建模方法,将个性化奖励建模重构为元学习问题。具体而言,我们将每个用户的奖励模型表示为基奖励函数的加权组合,并采用模型无关元学习框架优化权重参数的初始化,以支持有限反馈下的快速适应。为确保鲁棒性,我们提出鲁棒个性化目标,在元优化过程中更关注难以学习的用户群体。基于个性化偏好数据集的广泛实验表明,该方法能有效提升小样本个性化性能,增强用户鲁棒性,并持续超越基线模型。
English
Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.