模仿物理學家之眼：以視覺語言模型為核心的物理公式發現方法

摘要

從現實世界的觀測數據中自動發現物理定律，是人工智能領域的一大挑戰。現有方法依賴於符號回歸或大型語言模型（LLMs），僅限於單模態數據，忽視了對物理學家而言不可或缺的豐富視覺現象學運動表徵。這種“感官剝奪”嚴重削弱了它們解讀動態現象內在時空模式的能力。為彌補這一不足，我們提出了VIPER-R1，這是一個多模態模型，旨在通過視覺歸納進行基於物理的方程推理，以發現基礎符號公式。該模型整合了視覺感知、軌跡數據與符號推理，模擬科學發現過程。模型通過運動結構歸納（MSI）課程進行訓練，利用監督微調來解讀運動學相圖，並在因果思維鏈（C-CoT）的指導下構建假設，隨後通過獎勵引導的符號校準（RGSC）利用強化學習精煉公式結構。在推理階段，訓練完成的VIPER-R1作為代理：首先提出一個高置信度的符號假設，然後主動調用外部符號回歸工具執行符號殘差重對齊（SR^2）。這一步驟類似於物理學家的微擾分析，旨在調合理論模型與實證數據。為支持此研究，我們引入了PhysSymbol，一個包含5,000個實例的新多模態語料庫。實驗表明，VIPER-R1在準確性和可解釋性上持續超越現有最先進的視覺語言模型（VLM）基線，實現了更精確的物理定律發現。項目頁面：https://jiaaqiliu.github.io/VIPER-R1/

English

Automated discovery of physical laws from observational data in the real world is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temporal patterns within dynamic phenomena. To address this gap, we propose VIPER-R1, a multimodal model that performs Visual Induction for Physics-based Equation Reasoning to discover fundamental symbolic formulas. It integrates visual perception, trajectory data, and symbolic reasoning to emulate the scientific discovery process. The model is trained via a curriculum of Motion Structure Induction (MSI), using supervised fine-tuning to interpret kinematic phase portraits and to construct hypotheses guided by a Causal Chain of Thought (C-CoT), followed by Reward-Guided Symbolic Calibration (RGSC) to refine the formula structure with reinforcement learning. During inference, the trained VIPER-R1 acts as an agent: it first posits a high-confidence symbolic ansatz, then proactively invokes an external symbolic regression tool to perform Symbolic Residual Realignment (SR^2). This final step, analogous to a physicist's perturbation analysis, reconciles the theoretical model with empirical data. To support this research, we introduce PhysSymbol, a new 5,000-instance multimodal corpus. Experiments show that VIPER-R1 consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability, enabling more precise discovery of physical laws. Project page: https://jiaaqiliu.github.io/VIPER-R1/

模仿物理學家之眼：以視覺語言模型為核心的物理公式發現方法

Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

摘要

Support