物理问题场景图：文本到视频生成中物理合理性的细粒度评估

摘要

视频生成模型在生成逼真视频方面能力日益增强，但仍难以生成遵循基本物理定律的视频。雪上加霜的是，目前缺乏可靠的细粒度评估方法来定位和具体说明视频中违反物理定律的问题。为此，我们引入了物理问题场景图（PQSG），一种基于层次化问题的评估流程。PQSG通过利用视觉语言模型（VLM）并辅以高质量的上下文示例，生成基于图结构的问题层级，从而从对象、动作以及物理定律遵循度三个方面检验生成视频与提示词的一致性。通过将问题表示为图结构，PQSG在问题间引入了逻辑依赖关系，确保每个查询在语境上有效。此外，PQSG还能提供细粒度的评估，明确视频的哪些质量特性违反了物理合理性约束。我们通过创建FinePhyEval数据集来验证PQSG，该数据集包含基于物理的提示词以及来自多种先进视频生成模型（Sora 2、Veo 3和Wan 2.1）的对应生成视频，每个视频均经过多类别人工标注。利用FinePhyEval，我们测量了PQSG细粒度得分与人工判断之间的相关性，结果表明其整体相关性高于以往工作。我们还发现，PQSG对闭源模型在物理真实性方面的排名高于Wan 2.1。最后，我们展示了FinePhyEval中提供的标注也可用于子任务评估：我们对两个强大的VLM在生成问题和回答问题方面进行了基准测试，发现虽然模型能生成类似人类的问题，但在回答问题的准确性上仍不及人类水平。

English

Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.