TRAVL：提升视频-语言模型对物理不可行性判断能力的优化方案

摘要

尽管现代视频生成模型在视觉保真度上表现卓越，但其生成的序列常常违背直观的物理定律，例如物体漂浮、瞬间移动或以违背因果关系的方式变形。虽然人类能轻易察觉这些不合理之处，但目前尚缺乏一种可靠的方法来定量评估视频中的物理真实性。本研究探讨了视频-语言模型（VLMs）是否能够被训练成为物理合理性的可靠评判者。我们发现，现有的VLMs在识别物理违规方面存在困难，暴露了它们在时间和因果推理上的根本局限。为解决这一问题，我们提出了TRAVL，一种结合了平衡训练数据集与轨迹感知注意力模块的微调方案，旨在提升VLMs中的运动编码与辨别能力。为了更严格地评估物理推理，我们引入了ImplausiBench，一个包含300个视频（150个真实，150个生成）的基准测试，它消除了语言偏见，专注于视觉-时间理解。性能评估既基于黄金标准的人类判断，也采用了更为严格的LLM作为评判者的指标。TRAVL与ImplausiBench共同构成了一个统一框架，用于探索并提升多模态模型中的物理合理性，为视觉-时间理解这一具有挑战性且尚未充分探索的领域提供了新的见解。

English

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

TRAVL：提升视频-语言模型对物理不可行性判断能力的优化方案

TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

摘要

Support