经由再生实现精修：拓展修改空间提升统一多模态模型中的图像优化能力

摘要

统一多模态模型（UMMs）将视觉理解与生成能力整合于单一框架。在文生图（T2I）任务中，这种统一特性使UMMs能在初始生成后对输出进行优化，有望突破性能上限。当前基于UMM的优化方法主要遵循"编辑式优化"（RvE）范式，即通过生成编辑指令来修正未对齐区域，同时保留已对齐内容。然而编辑指令往往仅能粗略描述提示词与图像间的偏差，导致优化不彻底。此外，像素级保留策略虽为编辑所需，却过度限制了有效的优化修改空间。为突破这些局限，我们提出"再生式优化"（RvR）新框架，将优化重构为条件图像再生而非编辑。RvR摒弃编辑指令与严格内容保留机制，转而根据目标提示词和初始图像的语义标记进行图像再生，从而在更大修改空间内实现更完整的语义对齐。大量实验表明，RvR将Geneval指标从0.78提升至0.91，DPGBench从84.02提升至87.21，UniGenBench++从61.53提升至77.41，验证了其有效性。

English

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

经由再生实现精修：拓展修改空间提升统一多模态模型中的图像优化能力

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

摘要

Support