迈向忠实且可控的个性化:基于批评-后编辑强化学习的方法
Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
October 21, 2025
作者: Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou
cs.AI
摘要
忠实个性化大型语言模型(LLMs)以契合个体用户偏好,是一项关键却充满挑战的任务。尽管监督微调(SFT)迅速触及性能瓶颈,标准的人类反馈强化学习(RLHF)同样难以驾驭个性化的细微差别。基于标量的奖励模型易受奖励欺骗影响,导致冗长且表面个性化的回应。为应对这些局限,我们提出了“批评-后编辑”框架,一种稳健的强化学习架构,旨在实现更为忠实且可控的个性化。该框架融合两大核心要素:(1)个性化生成奖励模型(GRM),它提供多维评分与文本批评,以抵御奖励欺骗;(2)批评-后编辑机制,在此策略模型依据批评修订自身输出,实现更精准高效的学习。在严格的长度控制评估下,我们的方法在个性化基准上显著优于标准PPO。个性化Qwen2.5-7B模型实现了平均11%的胜率提升,而个性化Qwen2.5-14B模型更是超越了GPT-4.1的表现。这些成果展示了一条通往忠实、高效且可控个性化的实践路径。
English
Faithfully personalizing large language models (LLMs) to align with
individual user preferences is a critical but challenging task. While
supervised fine-tuning (SFT) quickly reaches a performance plateau, standard
reinforcement learning from human feedback (RLHF) also struggles with the
nuances of personalization. Scalar-based reward models are prone to reward
hacking which leads to verbose and superficially personalized responses. To
address these limitations, we propose Critique-Post-Edit, a robust
reinforcement learning framework that enables more faithful and controllable
personalization. Our framework integrates two key components: (1) a
Personalized Generative Reward Model (GRM) that provides multi-dimensional
scores and textual critiques to resist reward hacking, and (2) a
Critique-Post-Edit mechanism where the policy model revises its own outputs
based on these critiques for more targeted and efficient learning. Under a
rigorous length-controlled evaluation, our method substantially outperforms
standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an
average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses
the performance of GPT-4.1. These results demonstrate a practical path to
faithful, efficient, and controllable personalization.