《重校準:結構化推理引導的對齊技術在情境圖像生成與編輯中的應用》
Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing
January 8, 2026
作者: Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai
cs.AI
摘要
情境图像生成与编辑技术允许用户通过交错排列的图文提示来指定视觉概念,这要求模型能精准理解并忠实执行用户意图。尽管近期出现的统一多模态模型展现出卓越的理解能力,但这些优势往往难以有效迁移到图像生成领域。我们提出Re-Align这一统一框架,通过结构化推理引导的对齐机制弥合理解与生成之间的鸿沟。其核心是情境思维链——一种能解耦语义引导与参考关联的结构化推理范式,既可提供清晰的文本目标,又能缓解参考图像间的相互干扰。此外,Re-Align引入高效的强化学习训练方案,利用代理奖励衡量结构化推理文本与生成图像之间的对齐度,从而提升模型在情境图像生成与编辑任务上的整体表现。大量实验证实,在可比模型规模与资源条件下,Re-Align在情境图像生成和编辑任务上均优于现有竞争方法。
English
In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.