SRUM:面向统一多模态模型的细粒度自奖励机制
SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
October 14, 2025
作者: Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu
cs.AI
摘要
近期,统一多模态模型(UMMs)取得了显著进展,这类模型在单一框架内整合了视觉-语言生成与理解能力。然而,一个显著差距在于,模型强大的视觉理解能力往往无法有效转化为视觉生成能力。模型可能基于用户指令正确理解图像,却无法从文本提示中生成忠实于描述的图像。这一现象直接引发了一个引人深思的问题:模型能否利用其理解模块来奖励生成模块,从而实现自我提升?为弥合这一差距并实现自我改进,我们提出了SRUM,一种可直接应用于现有各类设计UMMs的自奖励后训练框架。SRUM构建了一个反馈循环,其中模型自身的理解模块充当内部“评估者”,为生成模块提供纠正信号,无需额外人工标注数据。为确保反馈的全面性,我们设计了全局-局部双奖励系统。针对图像固有的结构复杂性,该系统提供多尺度指导:全局奖励确保整体视觉语义与布局的正确性,而局部奖励则优化细粒度、对象级别的保真度。SRUM赋予了模型强大的能力,并展现出优异的泛化性能,在T2I-CompBench上的表现从82.18提升至88.37,在T2I-ReasonBench上从43.82提升至46.75。总体而言,我们的工作确立了一种强有力的新范式,使UMMs的理解模块能够通过自奖励机制引导并增强其自身的生成能力。
English
Recently, remarkable progress has been made in Unified Multimodal Models
(UMMs), which integrate vision-language generation and understanding
capabilities within a single framework. However, a significant gap exists where
a model's strong visual understanding often fails to transfer to its visual
generation. A model might correctly understand an image based on user
instructions, yet be unable to generate a faithful image from text prompts.
This phenomenon directly raises a compelling question: Can a model achieve
self-improvement by using its understanding module to reward its generation
module? To bridge this gap and achieve self-improvement, we introduce SRUM, a
self-rewarding post-training framework that can be directly applied to existing
UMMs of various designs. SRUM creates a feedback loop where the model's own
understanding module acts as an internal ``evaluator'', providing corrective
signals to improve its generation module, without requiring additional
human-labeled data. To ensure this feedback is comprehensive, we designed a
global-local dual reward system. To tackle the inherent structural complexity
of images, this system offers multi-scale guidance: a global reward
ensures the correctness of the overall visual semantics and layout, while a
local reward refines fine-grained, object-level fidelity. SRUM leads
to powerful capabilities and shows strong generalization, boosting performance
on T2I-CompBench from 82.18 to 88.37 and on T2I-ReasonBench from 43.82
to 46.75. Overall, our work establishes a powerful new paradigm for
enabling a UMMs' understanding module to guide and enhance its own generation
via self-rewarding.