FlowInOne：统一多模态生成任务为图像输入-图像输出的流匹配范式

摘要

长期以来，多模态生成领域始终由文本驱动范式主导——语言虽能指导视觉生成，却无法在视觉空间中进行推理与创作。我们通过探索能否将文本描述、空间布局和编辑指令等所有模态统一为单一视觉表征，对这一范式提出挑战。本文提出FlowInOne框架，将多模态生成重构为纯粹视觉流处理：所有输入被转化为视觉提示，形成由单一流匹配模型控制的简洁"图像进-图像出"流程。这种以视觉为核心的范式天然消除了跨模态对齐瓶颈、噪声调度机制和任务特定架构分支，将文生图、布局引导编辑和视觉指令跟随统一在连贯的体系下。为此我们构建了VisPrompt-5M数据集（包含500万视觉提示对，涵盖物理感知力动力学、轨迹预测等多样化任务）以及VP-Bench基准（从指令遵循度、空间精度、视觉真实性和内容一致性四个维度进行严格评估）。大量实验表明，FlowInOne在所有统一生成任务中均达到最先进性能，超越开源模型与商业竞品，为完全以视觉为中心的生成建模奠定了新基础——使感知与创作在连续视觉空间中共存。

English

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.