物理学者の視点を模倣する：物理公式発見のためのVLM中心アプローチ

要旨

現実世界の観測データから物理法則を自動的に発見することは、AIにおける大きな課題である。現在の手法は、記号的回帰や大規模言語モデル（LLM）に依存しており、単一モードのデータに限定され、物理学者にとって不可欠な運動の視覚的現象論的表現を見落としている。この「感覚遮断」は、動的現象に内在する時空間パターンを解釈する能力を著しく弱めている。このギャップを埋めるため、我々はVIPER-R1を提案する。これは、視覚的帰納による物理ベースの方程式推論を行い、基本的な記号的公式を発見するマルチモーダルモデルである。このモデルは、視覚的知覚、軌跡データ、記号的推論を統合し、科学的発見プロセスを模倣する。モデルは、運動構造帰納（MSI）のカリキュラムを通じて訓練され、運動学的位相ポートレートを解釈し、因果連鎖思考（C-CoT）に導かれた仮説を構築するための教師あり微調整を行い、その後、報酬誘導型記号的キャリブレーション（RGSC）を用いて、強化学習により公式構造を洗練する。推論時には、訓練されたVIPER-R1はエージェントとして機能し、まず高信頼度の記号的アンザッツを提示し、次に外部の記号的回帰ツールを積極的に呼び出して記号的残差再調整（SR^2）を実行する。この最終ステップは、物理学者の摂動解析に類似しており、理論モデルと経験的データを調和させる。この研究を支援するため、我々はPhysSymbolという新しい5,000インスタンスのマルチモーダルコーパスを導入する。実験結果は、VIPER-R1が精度と解釈可能性において最先端の視覚言語モデル（VLM）ベースラインを一貫して上回り、物理法則のより正確な発見を可能にすることを示している。プロジェクトページ: https://jiaaqiliu.github.io/VIPER-R1/

English

Automated discovery of physical laws from observational data in the real world is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temporal patterns within dynamic phenomena. To address this gap, we propose VIPER-R1, a multimodal model that performs Visual Induction for Physics-based Equation Reasoning to discover fundamental symbolic formulas. It integrates visual perception, trajectory data, and symbolic reasoning to emulate the scientific discovery process. The model is trained via a curriculum of Motion Structure Induction (MSI), using supervised fine-tuning to interpret kinematic phase portraits and to construct hypotheses guided by a Causal Chain of Thought (C-CoT), followed by Reward-Guided Symbolic Calibration (RGSC) to refine the formula structure with reinforcement learning. During inference, the trained VIPER-R1 acts as an agent: it first posits a high-confidence symbolic ansatz, then proactively invokes an external symbolic regression tool to perform Symbolic Residual Realignment (SR^2). This final step, analogous to a physicist's perturbation analysis, reconciles the theoretical model with empirical data. To support this research, we introduce PhysSymbol, a new 5,000-instance multimodal corpus. Experiments show that VIPER-R1 consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability, enabling more precise discovery of physical laws. Project page: https://jiaaqiliu.github.io/VIPER-R1/

物理学者の視点を模倣する：物理公式発見のためのVLM中心アプローチ

Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

要旨

Support