물리 질문 장면 그래프: 텍스트-비디오 생성에서 물리적 타당성에 대한 세밀한 평가

초록

비디오 생성 모델은 점점 더 사실적인 영상을 생성할 수 있게 되었지만, 여전히 기본적인 물리 법칙을 따르는 비디오를 생성하는 데 어려움을 겪고 있다. 여기에 더해, 비디오 내 물리 법칙 위반을 위치 파악하고 구체적으로 명시할 수 있는 신뢰할 수 있는 세분화된 평가 방법이 부족하다. 우리는 이 문제를 해결하기 위해 계층적 질문 기반 평가 파이프라인인 물리 질문 장면 그래프(PQSG)를 도입한다. PQSG는 고품질의 맥락 내 예시에 의해 안내되는 시각-언어 모델(VLM)이 생성한 질문의 그래프 기반 계층 구조를 사용하여 생성된 비디오가 객체, 동작, 물리 법칙 준수 측면에서 프롬프트에 얼마나 충실한지 평가한다. 질문을 그래프로 표현함으로써 PQSG는 질문 내 논리적 의존성을 도입하여 각 질의가 맥락적으로 타당하도록 보장한다. 또한 PQSG는 비디오의 어떤 특성이 물리적 타당성 제약을 위반하는지에 대한 세분화된 평가를 제공한다. 우리는 PQSG를 검증하기 위해 FinePhyEval 데이터셋을 구축했다. 이 데이터셋은 물리 기반 프롬프트와 다양한 최첨단 비디오 생성 모델(Sora 2, Veo 3, Wan 2.1)로 생성된 해당 비디오로 구성되며, 각 비디오는 인간에 의해 여러 범주에 걸쳐 주석이 달려 있다. FinePhyEval을 사용하여 PQSG의 세분화된 점수와 인간 판단 간의 상관관계를 측정한 결과, 이전 연구보다 전반적으로 높은 상관관계를 보였다. 또한 PQSG는 폐쇄형 소스 모델이 Wan 2.1보다 물리적 현실감 측면에서 더 높은 순위를 매긴다는 것을 발견했다. 마지막으로, FinePhyEval에서 제공하는 주석이 하위 작업 평가에도 사용될 수 있음을 보여준다. 두 개의 강력한 VLM을 대상으로 질문 생성 및 응답 능력을 벤치마킹한 결과, 모델이 인간과 유사한 질문을 생성할 수 있지만, 응답 성능에서는 여전히 인간에 미치지 못함을 발견했다.

English

Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.