新时代的视觉生成：从原子映射到智能体世界建模的演进

摘要

近期视觉生成模型在写实性、版式呈现、指令跟随与交互编辑方面取得显著进展，但在空间推理、状态持久性、长程一致性及因果理解方面仍存在不足。我们认为，该领域应超越表象合成，迈向智能视觉生成：即基于结构、动力学、领域知识与因果关系的可信视觉内容生成。为界定这一转变，我们提出五级分类体系：原子生成、条件生成、上下文生成、主体性生成与世界建模生成，逐级实现从被动渲染器到具备交互性、主体意识与世界感知的生成器演进。我们分析了关键技术驱动力，包括流匹配、统一理解-生成模型、改进的视觉表征、后训练技术、奖励建模、数据策展、合成数据蒸馏及采样加速。研究进一步表明，当前评估方法因过度关注感知质量而忽视结构、时序与因果层面的缺陷，往往高估实际进展。通过结合基准评测综述、真实场景压力测试与专家约束案例研究，本路线图提供了以能力为核心的视角，用于理解、评估并推动下一代智能视觉生成系统的发展。

English

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.