ChatPaper.aiChatPaper

FlowInOne:将多模态生成统一为图像输入-图像输出的流匹配模型

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

April 8, 2026
作者: Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang
cs.AI

摘要

长久以来,多模态生成领域始终被文本驱动范式主导——语言虽能指导视觉生成,却无法在视觉空间中进行推理与创作。我们通过探索能否将文本描述、空间布局、编辑指令等所有模态统一为单一视觉表征,对这一范式提出挑战。本文提出FlowInOne框架,将多模态生成重新定义为纯粹视觉流处理:将所有输入转化为视觉提示,构建由单一流匹配模型控制的简洁"图像进-图像出"流程。这种以视觉为核心的表述方式自然消除了跨模态对齐瓶颈、噪声调度机制和任务特定架构分支,将文生图、布局引导编辑、视觉指令跟随统一于连贯范式之下。为支撑该框架,我们构建了包含500万视觉提示对的大规模数据集VisPrompt-5M,涵盖物理感知力场动态与轨迹预测等多样化任务,并推出经过严格校验的评估基准VP-Bench,从指令遵循度、空间精度、视觉真实感与内容一致性四个维度进行综合评估。大量实验表明,FlowInOne在所有统一生成任务中均达到最先进性能,超越开源模型与商业竞品,为完全以视觉为中心的生成式建模奠定了新基础——使感知与创作在连续视觉空间中共生。
English
Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.
PDF52April 10, 2026