ChatPaper.aiChatPaper

设想:面向因果世界过程洞察的统一理解与生成基准评测

Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

December 1, 2025
作者: Juanxi Tian, Siyuan Li, Conghui He, Lijun Wu, Cheng Tan
cs.AI

摘要

当前多模态模型旨在通过统一理解与生成能力来突破单模态表征的局限,常采用文本到图像(T2I)任务来校准语义一致性。然而,其在训练和评估中对静态单图生成的依赖,导致模型过度拟合静态模式匹配与语义融合,从根本上制约了对时间维度动态过程的建模能力。为突破这些限制,我们提出Envision——一个面向链式文本到多图生成的因果事件演进基准。该基准以世界知识为根基、时空因果关系为框架,重构了现有评估维度,涵盖六大科学与人文领域的1000个四阶段生成提示。为将评估从单图像转向序列帧,并检验模型是否在遵循因果时序约束的同时真正内化了世界知识,我们引入了Envision-Score这一融合多维度一致性、物理合理性与美学品质的综合指标。对15个模型(10个专用T2I模型、5个统一多模态模型)的系统性评估表明:专用T2I模型虽在美学渲染上表现优异,却缺乏内在世界知识;统一多模态模型弥补了这一鸿沟,在因果叙事连贯性上持续优于专用模型。然而,即便这些统一架构仍逊色于闭源模型,且难以克服时空一致性的核心挑战。这表明对因果孤立单图像的侧重会阻碍多帧推理与生成,促使模型偏向静态模式匹配而非动态世界建模,最终限制世界知识的内化与生成能力。
English
Current multimodal models aim to transcend the limitations of single-modality representations by unifying understanding and generation, often using text-to-image (T2I) tasks to calibrate semantic consistency. However, their reliance on static, single-image generation in training and evaluation leads to overfitting to static pattern matching and semantic fusion, while fundamentally hindering their ability to model dynamic processes that unfold over time. To address these constraints, we propose Envision-a causal event progression benchmark for chained text-to-multi-image generation. Grounded in world knowledge and structured by spatiotemporal causality, it reorganizes existing evaluation dimensions and includes 1,000 four-stage prompts spanning six scientific and humanities domains. To transition evaluation from single images to sequential frames and assess whether models truly internalize world knowledge while adhering to causal-temporal constraints, we introduce Envision-Score, a holistic metric integrating multi-dimensional consistency, physicality, and aesthetics. Comprehensive evaluation of 15 models (10 specialized T2I models, 5 unified models) uncovers: specialized T2I models demonstrate proficiency in aesthetic rendering yet lack intrinsic world knowledge. Unified multimodal models bridge this gap, consistently outperforming specialized counterparts in causal narrative coherence. However, even these unified architectures remain subordinate to closed-source models and struggle to overcome the core challenge of spatiotemporal consistency. This demonstrates that a focus on causally-isolated single images impedes multi-frame reasoning and generation, promoting static pattern matching over dynamic world modeling-ultimately limiting world knowledge internalization, generation.
PDF812December 3, 2025