ChatPaper.aiChatPaper

统一视觉生成个性化奖励模型

Unified Personalized Reward Model for Vision Generation

February 2, 2026
作者: Yibin Wang, Yuhang Zang, Feng Han, Jiazi Bu, Yujie Zhou, Cheng Jin, Jiaqi Wang
cs.AI

摘要

近期多模态奖励模型的突破性进展显著推动了视觉生成领域的发展。现有框架通常采用布拉德利-特里式偏好建模方法,或利用生成式视觉语言模型作为评判器,继而通过强化学习优化视觉生成模型。然而当前奖励模型存在固有局限:它们往往遵循"一刀切"范式,要么假设存在统一的偏好分布,要么依赖固定评估标准。这导致模型对内容特定的视觉线索不敏感,从而与主观且依赖语境的人类偏好产生系统性偏差。为此,我们借鉴人类评估机制提出UnifiedReward-Flex——一种融合奖励建模与柔性语境自适应推理的统一个性化视觉生成奖励模型。具体而言,给定提示词与生成视觉内容,该模型首先解析语义意图并基于视觉证据进行锚定,继而通过实例化预定义和自主生成的高维标准下的细粒度评估准则,动态构建分层评估体系。我们的训练流程采用两阶段策略:首先从先进闭源视觉语言模型蒸馏高质量结构化推理轨迹,通过监督微调赋予模型柔性语境自适应推理能力;随后对精心筛选的偏好对进行直接偏好优化,进一步增强推理忠实度与判别对齐效果。为验证有效性,我们将UnifiedReward-Flex集成至GRPO框架进行图像与视频生成实验,大量结果表明该模型的优越性。
English
Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.
PDF161February 5, 2026