洗練から再生成へ：修正空間の拡大が統合マルチモーダルモデルにおける画像精緻化を促進する

要旨

統合マルチモーダルモデル（UMM）は、視覚的理解と生成を単一のフレームワークに統合する。テキストから画像への生成（T2I）タスクにおいて、この統合能力によりUMMは生成後の出力を精緻化でき、性能の上限を引き上げる可能性がある。現在のUMMベースの精緻化手法は、主に「編集による精緻化（RvE）」パラダイムに従っており、UMMが編集指示を生成して不整合領域を修正しつつ、整合したコンテンツを保持する。しかし、編集指示はプロンプトと画像の不整合を大まかにしか記述せず、不完全な精緻化につながる。さらに、ピクセルレベルの保持は編集に必要だが、精緻化における有効な修正空間を不必要に制限する。これらの課題を解決するため、我々は「再生による精緻化（RvR）」という新規フレームワークを提案する。RvRは精緻化を編集ではなく条件付き画像再生として再定義し、編集指示への依存や厳密なコンテンツ保持を回避する。代わりに、目標プロンプトと初期画像の意味トークンを条件として画像を再生成し、より完全な意味的整合性と大きな修正空間を実現する。大規模な実験により、RvRの有効性が実証され、Genevalが0.78から0.91へ、DPGBenchが84.02から87.21へ、UniGenBench++が61.53から77.41へ改善された。

English

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

洗練から再生成へ：修正空間の拡大が統合マルチモーダルモデルにおける画像精緻化を促進する

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

要旨

Support