物理問題シーングラフ：テキストからビデオ生成における物理的妥当性の細粒度評価

要旨

動画生成モデルは現実的な動画を生成する能力が向上しているが、基本的な物理法則に従う動画を生成することには依然として苦慮している。これに加えて、動画内の物理法則違反を特定し詳しく評価するための信頼性の高い詳細評価手法が不足している。本研究では、この問題に対処するために、Physics Question Scene Graph（PQSG）を導入する。PQSGは、階層的な質問ベースの評価パイプラインである。PQSGは、視覚言語モデル（VLM）によって生成され、高品質なインコンテキスト例によって導かれるグラフベースの質問階層を用いて、オブジェクト、アクション、物理法則への準拠に関するプロンプトへの忠実性をチェックすることで、生成された動画を評価する。質問をグラフとして表現することで、PQSGは質問内に論理的依存関係を導入し、各クエリが文脈的に有効であることを保証する。さらに、PQSGは動画のどの性質が物理的妥当性の制約に違反しているかを詳細に評価する。我々は、FinePhyEvalというデータセットを作成することでPQSGを検証する。このデータセットは、物理ベースのプロンプトと、多様な最先端動画生成モデル（Sora 2、Veo 3、Wan 2.1）からの対応する生成動画で構成され、各動画は人間によって複数のカテゴリにわたって注釈が付けられている。FinePhyEvalを用いて、PQSGの詳細スコアと人間の判断との相関を測定し、従来研究よりも全体的に高い相関を示す。また、PQSGは物理的現実性において、クローズドソースモデルをWan 2.1よりも高く評価することが分かった。最後に、FinePhyEvalで提供する注釈がサブタスク評価にも使用できることを示す。2つの強力なVLMを用いて質問の生成と回答に関するベンチマークを行い、モデルは人間らしい質問を作成できるものの、回答においては人間の性能に及ばないことが分かった。

English

Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.