SRUM:統一多模態模型的細粒度自我獎勵機制
SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
October 14, 2025
作者: Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu
cs.AI
摘要
近期,統一多模態模型(Unified Multimodal Models, UMMs)在整合視覺-語言生成與理解能力方面取得了顯著進展。然而,存在一個顯著的差距:模型強大的視覺理解能力往往無法有效轉移到其視覺生成上。模型可能基於用戶指令正確理解圖像,卻無法從文本提示生成忠實的圖像。這一現象直接引發了一個引人深思的問題:模型能否利用其理解模塊來獎勵其生成模塊,從而實現自我提升?為彌合這一差距並實現自我提升,我們引入了SRUM,這是一種自獎勵的後訓練框架,可直接應用於各種設計的現有UMMs。SRUM創建了一個反饋循環,其中模型自身的理解模塊充當內部「評估者」,提供糾正信號以改進其生成模塊,而無需額外的人類標註數據。為確保這一反饋的全面性,我們設計了一個全局-局部雙重獎勵系統。為應對圖像固有的結構複雜性,該系統提供多尺度指導:全局獎勵確保整體視覺語義和佈局的正確性,而局部獎勵則精細化對象層面的保真度。SRUM帶來了強大的能力並展現出良好的泛化性,在T2I-CompBench上的性能從82.18提升至88.37,在T2I-ReasonBench上從43.82提升至46.75。總體而言,我們的工作建立了一個強大的新範式,使UMMs的理解模塊能夠通過自獎勵來指導和增強其自身的生成能力。
English
Recently, remarkable progress has been made in Unified Multimodal Models
(UMMs), which integrate vision-language generation and understanding
capabilities within a single framework. However, a significant gap exists where
a model's strong visual understanding often fails to transfer to its visual
generation. A model might correctly understand an image based on user
instructions, yet be unable to generate a faithful image from text prompts.
This phenomenon directly raises a compelling question: Can a model achieve
self-improvement by using its understanding module to reward its generation
module? To bridge this gap and achieve self-improvement, we introduce SRUM, a
self-rewarding post-training framework that can be directly applied to existing
UMMs of various designs. SRUM creates a feedback loop where the model's own
understanding module acts as an internal ``evaluator'', providing corrective
signals to improve its generation module, without requiring additional
human-labeled data. To ensure this feedback is comprehensive, we designed a
global-local dual reward system. To tackle the inherent structural complexity
of images, this system offers multi-scale guidance: a global reward
ensures the correctness of the overall visual semantics and layout, while a
local reward refines fine-grained, object-level fidelity. SRUM leads
to powerful capabilities and shows strong generalization, boosting performance
on T2I-CompBench from 82.18 to 88.37 and on T2I-ReasonBench from 43.82
to 46.75. Overall, our work establishes a powerful new paradigm for
enabling a UMMs' understanding module to guide and enhance its own generation
via self-rewarding.