設想:為因果世界過程洞察建立統一理解與生成的基準測試
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
December 1, 2025
作者: Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
cs.AI
摘要
當前多模態模型旨在超越單模態表徵的局限,通過統一理解與生成能力,常採用文本到圖像(T2I)任務來校準語義一致性。然而,其在訓練與評估中依賴靜態單圖像生成,導致過度擬合靜態模式匹配與語義融合,從根本上限制了模型對時序動態過程的建模能力。為突破這些限制,我們提出Envision——一個專注於因果事件演進的鏈式文本到多圖像生成基準。該基準以世界知識為基礎,通過時空因果關係構建結構,重組現有評估維度,涵蓋六大科學與人文領域的1,000個四階段生成提示。為將評估從單圖像過渡到序列幀,並檢驗模型是否真正內化世界知識且遵循因果時序約束,我們引入Envision-Score這一融合多維一致性、物理合理性與美學質量的綜合指標。對15個模型(10個專用T2I模型、5個統一多模態模型)的全面評估揭示:專用T2I模型雖擅長美學渲染,但缺乏內在世界知識;統一多模態模型能彌合此差距,在因果敘事連貫性上持續優於專用模型。然而,即便這些統一架構仍遜於閉源模型,且難以克服時空一致性的核心挑戰。這表明對因果孤立單圖像的關注會阻礙多幀推理與生成,促使模型偏向靜態模式匹配而非動態世界建模,最終限制世界知識的內化與生成能力。
English
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.