ChatPaper.aiChatPaper

視覺語言模型中的可解釋物理推理與性能分類

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

September 10, 2025
作者: Pranav Pawar, Kavish Shah, Akshat Bhalani, Komal Kasat, Dev Mittal, Hadi Gala, Deepali Patil, Nikita Raichada, Monali Deshmukh
cs.AI

摘要

隨著視覺-語言模型(VLMs)日益精進,其推理能力正受到越來越多的關注。儘管這些模型在許多任務上表現出色,但對於基礎科學原理(如物理學)的理解仍是一個未被充分探索的領域。為了反映這些能力的進展,我們引入了一個新穎且易於使用的框架,旨在嚴格評估VLMs對二維物理學的理解。該框架配備了一個實用的場景生成器,能夠在四個核心領域(拋體運動、碰撞動力學、力學和流體動力學)中創建超過400個多樣化的測試問題。通過對四種最先進的VLMs進行全面評估,我們展示了模型規模與推理能力之間的強烈相關性,其中表現最佳的模型Qwen2.5-VL-7B獲得了0.815的總分。我們發現,雖然模型在公式化問題上表現優異,但在需要抽象空間推理的領域中卻面臨顯著挑戰。通過設計這一框架,我們希望普及對VLMs科學推理能力的研究,並促進對其能力和局限性的深入理解。
English
As Vision-Language Models (VLMs) grow in sophistication, their ability to perform reasoning is coming under increasing supervision. While they excel at many tasks, their grasp of fundamental scientific principles, such as physics, remains an underexplored frontier. To reflect the advancements in these capabilities, we introduce a novel and accessible framework designed to rigorously evaluate VLMs on their understanding of 2D physics. Our framework features a pragmatic scenario generator that creates a diverse testbed of over 400 problems across four core domains: Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics. Through comprehensive evaluation of four state-of-the-art VLMs, we demonstrate a strong correlation between model scale and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving an overall score of 0.815. We find that while models excel at formulaic problems, they struggle significantly with domains requiring abstract spatial reasoning. By designing this framework, we aim to democratize the study of scientific reasoning in VLMs and foster deeper insights into their capabilities and limitations.
PDF214January 19, 2026