QuantiPhy:评估视觉语言模型物理推理能力的量化基准
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
December 22, 2025
作者: Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli
cs.AI
摘要
理解物理世界对通用人工智能体至关重要。然而,当前最先进的视觉感知模型(如大型视觉语言模型)是否具备定量推理物理属性的能力仍不明确。现有评估方法主要基于视觉问答范式且偏向定性分析,难以深入考察这些模型能否从视频观察中推断运动物体的运动学量值。为此,我们推出QuantiPhy——首个专为定量评估视觉语言模型物理推理能力设计的基准测试。该数据集包含3,300余个带数值真值的视频-文本实例,通过将物体尺寸、速度或加速度中的某一属性作为输入先验,评估模型在给定时间点估算另外两个属性的表现。该基准采用标准化提示词与评分机制来检验数值准确性,确保模型间可比性。我们对前沿视觉语言模型的实验表明,其定性合理性与实际数值准确性之间存在系统性差距。通过深入分析背景干扰、反事实先验及策略性提示等关键因素,发现现有模型在定量推理运动学属性时严重依赖预训练的世界知识,而非忠实参照提供的视觉与文本输入。QuantiPhy首次构建了严谨可扩展的测试平台,推动视觉语言模型突破语言合理性层面,迈向基于数值的物理认知。
English
Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.