模仿物理学家之眼：面向物理公式发现的视觉语言模型中心化方法

摘要

从现实世界的观测数据中自动发现物理定律是人工智能领域的一项重大挑战。现有方法依赖于符号回归或大语言模型（LLMs），仅限于单模态数据，忽视了物理学家不可或缺的丰富视觉运动现象表征。这种“感官剥夺”严重削弱了它们解释动态现象中固有时空模式的能力。为弥补这一不足，我们提出了VIPER-R1，一个多模态模型，通过视觉归纳进行基于物理的方程推理，以发现基础符号公式。该模型整合了视觉感知、轨迹数据和符号推理，模拟科学发现过程。模型通过运动结构归纳（MSI）课程进行训练，利用监督微调解释运动学相图，并构建由因果思维链（C-CoT）引导的假设，随后通过奖励引导的符号校准（RGSC）利用强化学习优化公式结构。在推理阶段，训练后的VIPER-R1充当代理：首先提出一个高置信度的符号假设，然后主动调用外部符号回归工具执行符号残差重对齐（SR^2）。这一最终步骤类似于物理学家的扰动分析，将理论模型与经验数据相协调。为支持此项研究，我们引入了PhysSymbol，一个包含5,000个实例的新多模态语料库。实验表明，VIPER-R1在准确性和可解释性上持续超越最先进的视觉语言模型（VLM）基线，实现了更精确的物理定律发现。项目页面：https://jiaaqiliu.github.io/VIPER-R1/

English

Automated discovery of physical laws from observational data in the real world is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temporal patterns within dynamic phenomena. To address this gap, we propose VIPER-R1, a multimodal model that performs Visual Induction for Physics-based Equation Reasoning to discover fundamental symbolic formulas. It integrates visual perception, trajectory data, and symbolic reasoning to emulate the scientific discovery process. The model is trained via a curriculum of Motion Structure Induction (MSI), using supervised fine-tuning to interpret kinematic phase portraits and to construct hypotheses guided by a Causal Chain of Thought (C-CoT), followed by Reward-Guided Symbolic Calibration (RGSC) to refine the formula structure with reinforcement learning. During inference, the trained VIPER-R1 acts as an agent: it first posits a high-confidence symbolic ansatz, then proactively invokes an external symbolic regression tool to perform Symbolic Residual Realignment (SR^2). This final step, analogous to a physicist's perturbation analysis, reconciles the theoretical model with empirical data. To support this research, we introduce PhysSymbol, a new 5,000-instance multimodal corpus. Experiments show that VIPER-R1 consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability, enabling more precise discovery of physical laws. Project page: https://jiaaqiliu.github.io/VIPER-R1/

模仿物理学家之眼：面向物理公式发现的视觉语言模型中心化方法

Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

摘要

Support