视觉语言模型中的可解释物理推理与性能分类体系
Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models
September 10, 2025
作者: Pranav Pawar, Kavish Shah, Akshat Bhalani, Komal Kasat, Dev Mittal, Hadi Gala, Deepali Patil, Nikita Raichada, Monali Deshmukh
cs.AI
摘要
随着视觉-语言模型(VLMs)日益精进,其推理能力正受到越来越多的关注。尽管这些模型在许多任务上表现出色,但它们对基础科学原理(如物理学)的理解仍是一个尚未充分探索的领域。为了反映这些能力的进步,我们引入了一个新颖且易于使用的框架,旨在严格评估VLMs对二维物理学的理解。该框架包含一个实用的场景生成器,能够创建涵盖四个核心领域(抛体运动、碰撞动力学、力学和流体动力学)的400多个多样化测试问题。通过对四个最先进的VLMs进行全面评估,我们展示了模型规模与推理能力之间的强相关性,其中表现最佳的模型Qwen2.5-VL-7B获得了0.815的总分。我们发现,尽管模型在公式化问题上表现出色,但在需要抽象空间推理的领域中却面临显著挑战。通过设计这一框架,我们旨在普及VLMs科学推理的研究,并促进对其能力和局限性的深入理解。
English
As Vision-Language Models (VLMs) grow in sophistication, their ability to
perform reasoning is coming under increasing supervision. While they excel at
many tasks, their grasp of fundamental scientific principles, such as physics,
remains an underexplored frontier. To reflect the advancements in these
capabilities, we introduce a novel and accessible framework designed to
rigorously evaluate VLMs on their understanding of 2D physics. Our framework
features a pragmatic scenario generator that creates a diverse testbed of over
400 problems across four core domains: Projectile Motion, Collision Dynamics,
Mechanics, and Fluid Dynamics. Through comprehensive evaluation of four
state-of-the-art VLMs, we demonstrate a strong correlation between model scale
and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving
an overall score of 0.815. We find that while models excel at formulaic
problems, they struggle significantly with domains requiring abstract spatial
reasoning. By designing this framework, we aim to democratize the study of
scientific reasoning in VLMs and foster deeper insights into their capabilities
and limitations.