TRAVL：提升視頻語言模型對物理不可行性判斷能力的配方

摘要

儘管現代視頻生成模型在視覺逼真度上令人印象深刻，但它們經常產生違反直觀物理定律的序列，例如物體漂浮、瞬移或以違反因果關係的方式變形。雖然人類可以輕易察覺這些不合理之處，但目前尚無可靠的方法來定量評估視頻中的物理真實性。在本研究中，我們探討了視頻語言模型（VLMs）是否能夠被訓練成物理合理性的可靠評判者。我們發現現有的VLMs在識別物理違規方面存在困難，這暴露了它們在時間和因果推理上的根本局限性。為解決這一問題，我們引入了TRAVL，這是一種結合了平衡訓練數據集和軌跡感知注意力模組的微調方法，以改進VLMs中的運動編碼和辨識能力。為了更嚴格地評估物理推理，我們提出了ImplausiBench，這是一個包含300個視頻（150個真實，150個生成）的基準測試，它消除了語言偏見並隔離了視覺-時間理解。性能評估既基於黃金標準的人類判斷，也採用了更嚴格的LLM作為評判者的指標。TRAVL和ImplausiBench共同提供了一個統一的框架，用於探索和改進多模態模型中的物理合理性，揭示了視覺-時間理解中一個具有挑戰性且尚未充分探索的方面。

English

Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.

TRAVL：提升視頻語言模型對物理不可行性判斷能力的配方

TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility

摘要

Support