ChatPaper.aiChatPaper

邁向忠實且可控的個性化:基於批評-後編輯強化學習的方法

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

October 21, 2025
作者: Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou
cs.AI

摘要

忠實地個性化大型語言模型(LLMs)以符合個別用戶偏好,是一項關鍵但具挑戰性的任務。雖然監督式微調(SFT)能迅速達到性能瓶頸,但標準的基於人類反饋的強化學習(RLHF)也難以應對個性化的細微差異。基於標量的獎勵模型容易出現獎勵欺騙,導致冗長且表面化的個性化回應。為解決這些限制,我們提出了「批判後編輯」(Critique-Post-Edit),這是一個強大的強化學習框架,能夠實現更忠實且可控的個性化。我們的框架整合了兩個關鍵組件:(1) 一個個性化生成獎勵模型(GRM),它提供多維度評分和文本批判,以抵抗獎勵欺騙;(2) 一個批判後編輯機制,其中策略模型根據這些批判來修訂其輸出,以實現更精準且高效的學習。在嚴格的長度控制評估下,我們的方法在個性化基準上大幅超越了標準的PPO。個性化的Qwen2.5-7B模型平均提升了11%的勝率,而個性化的Qwen2.5-14B模型則超越了GPT-4.1的性能。這些結果展示了一條通往忠實、高效且可控個性化的實用路徑。
English
Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
PDF192October 22, 2025