新時代的視覺生成：從原子映射到能動世界建模的演進

摘要

近期視覺生成模型在寫實感、版面編排、指令遵循及互動編輯方面取得重大進展，但其在空間推理、狀態持續性、長時序一致性與因果理解方面仍存在侷限。我們主張領域應超越表觀合成，邁向具備智能的視覺生成：即基於結構、動力學、領域知識與因果關係的合理視覺內容。為界定此轉變，我們提出五級分類架構：原子生成、條件生成、情境生成、能動生成與世界建模生成，從被動渲染器逐步演進為具互動性、能動性與世界感知的生成器。我們分析關鍵技術驅動因素，包括流匹配、統一理解生成模型、改進的視覺表徵、後訓練、獎勵建模、資料策展、合成資料蒸餾及採樣加速技術。研究進一步指出，現有評估方法因過度強調感知質量而忽略結構性、時序性與因果性謬誤，往往高估實際進展。透過結合基準評測、真實場景壓力測試與專家約束案例研究，本路線圖提供以能力為核心的視角，用以理解、評估並推進新一代智能視覺生成系統的發展。

English

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

新時代的視覺生成：從原子映射到能動世界建模的演進

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

摘要

Support