FlowInOne：画像入力・画像出力フローマッチングとして統合するマルチモーダル生成

要旨

マルチモーダル生成は長らく、テキストが視覚を指示するが視覚内で推論や創造を行えないテキスト駆動パイプラインが支配的であった。我々はこのパラダイムに異議を唱え、テキスト記述、空間レイアウト、編集指示を含む全てのモダリティを単一の視覚的表現に統合できるか否かを問う。本論文では、マルチモーダル生成を純粋な視覚的フローとして再定義し、全ての入力を視覚的プロンプトに変換し、単一のフローマッチングモデルによって制御される簡潔な画像入力・画像出力パイプラインを実現するFlowInOneを提案する。この視覚中心の定式化は、モダリティ間調整のボトルネック、ノイズスケジューリング、タスク特化のアーキテクチャ分岐を自然に排除し、テキストからの画像生成、レイアウト誘導編集、視覚的指示追従を一貫したパラダイムの下に統合する。これを支えるため、物理法則を考慮した力学的動態や軌道予測を含む多様なタスクにわたる500万組の視覚プロンプトペアからなる大規模データセットVisPrompt-5Mと、指示忠実性、空間精度、視覚的真实性、内容一貫性を厳密に評価するベンチマークVP-Benchを導入する。大規模な実験により、FlowInOneが統合生成タスク全体においてオープンソースモデル及び競合する商用システムを凌駕する最高水準の性能を達成し、知覚と創造が単一の連続的視覚空間内で共存する完全に視覚中心の生成的モデリングの新たな基盤を確立することを実証する。

English

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.

FlowInOne：画像入力・画像出力フローマッチングとして統合するマルチモーダル生成

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

要旨

Support