ChatPaper.aiChatPaper

QuantiPhy:评估视觉语言模型物理推理能力的量化基准

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

December 22, 2025
作者: Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli
cs.AI

摘要

理解物理世界对通用人工智能体至关重要。然而,当前最先进的视觉感知模型(如大型视觉语言模型)是否具备定量推理物理属性的能力仍不明确。现有评估方法主要基于视觉问答范式且偏向定性分析,难以判断这些模型能否从视频观察中推断运动物体的运动学参数。为此,我们推出首个定量评估视觉语言模型物理推理能力的基准QuantiPhy。该基准包含3,300余个带数值真值的视频-文本实例,通过标准化提示模板和评分体系,评估模型在给定时间点依据某一先验属性(尺寸、速度或加速度)推算其他运动学量的数值准确性。我们对前沿视觉语言模型的实验表明,其定性合理性与实际数值准确性之间存在系统性差距。进一步深度分析显示,在定量推理运动学属性时,当前最先进的模型过度依赖预训练的世界知识,而非忠实利用提供的视觉文本输入作为参考依据。QuantiPhy首次为视觉语言模型提供了严谨可扩展的测试平台,推动其超越语言层面的合理性,迈向具有数值依据的物理认知。
English
Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.
PDF21December 25, 2025