ChatPaper.aiChatPaper

迈向忠实且可控的个性化:基于批评-后编辑强化学习的方法

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

October 21, 2025
作者: Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou
cs.AI

摘要

忠实个性化大型语言模型(LLMs)以契合个体用户偏好,是一项关键却充满挑战的任务。尽管监督微调(SFT)迅速触及性能瓶颈,标准的人类反馈强化学习(RLHF)同样难以驾驭个性化的细微差别。基于标量的奖励模型易受奖励欺骗影响,导致冗长且表面个性化的回应。为应对这些局限,我们提出了“批评-后编辑”框架,一种稳健的强化学习架构,旨在实现更为忠实且可控的个性化。该框架融合两大核心要素:(1)个性化生成奖励模型(GRM),它提供多维评分与文本批评,以抵御奖励欺骗;(2)批评-后编辑机制,在此策略模型依据批评修订自身输出,实现更精准高效的学习。在严格的长度控制评估下,我们的方法在个性化基准上显著优于标准PPO。个性化Qwen2.5-7B模型实现了平均11%的胜率提升,而个性化Qwen2.5-14B模型更是超越了GPT-4.1的表现。这些成果展示了一条通往忠实、高效且可控个性化的实践路径。
English
Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
PDF192October 22, 2025