물리학자의 눈을 모방하다: 물리 공식 발견을 위한 VLM 중심 접근법

초록

실세계 관측 데이터로부터 물리 법칙을 자동으로 발견하는 것은 AI 분야의 주요 도전 과제입니다. 현재의 방법들은 기호 회귀(symbolic regression)나 대형 언어 모델(LLMs)에 의존하며, 단일 모드 데이터에 국한되어 물리학자들에게 필수적인 풍부한 시각적 현상학적 운동 표현을 간과하고 있습니다. 이러한 "감각 박탈"은 동적 현상 내의 고유한 시공간 패턴을 해석하는 능력을 심각하게 약화시킵니다. 이러한 격차를 해결하기 위해, 우리는 VIPER-R1이라는 다중 모드 모델을 제안합니다. 이 모델은 시각적 인식을 통한 물리 기반 방정식 추론(Visual Induction for Physics-based Equation Reasoning)을 수행하여 근본적인 기호 공식을 발견합니다. 이 모델은 시각적 지각, 궤적 데이터, 그리고 기호 추론을 통합하여 과학적 발견 과정을 모방합니다. 이 모델은 운동 구조 유도(Motion Structure Induction, MSI) 커리큘럼을 통해 훈련되며, 지도 미세 조정(supervised fine-tuning)을 사용하여 운동학적 위상 도형(kinematic phase portraits)을 해석하고 인과적 사고 사슬(Causal Chain of Thought, C-CoT)에 의해 가이드된 가설을 구성합니다. 이후 강화 학습을 통해 공식 구조를 정제하는 보안 기호 보정(Reward-Guided Symbolic Calibration, RGSC)이 수행됩니다. 추론 과정에서 훈련된 VIPER-R1은 에이전트로 작동합니다: 먼저 높은 신뢰도를 가진 기호적 추정치(symbolic ansatz)를 제시한 후, 외부 기호 회귀 도구를 적극적으로 호출하여 기호 잔차 재조정(Symbolic Residual Realignment, SR^2)을 수행합니다. 이 최종 단계는 물리학자의 섭동 분석(perturbation analysis)과 유사하며, 이론적 모델과 경험적 데이터를 조정합니다. 이 연구를 지원하기 위해, 우리는 새로운 5,000개의 다중 모드 코퍼스인 PhysSymbol을 소개합니다. 실험 결과, VIPER-R1은 정확성과 해석 가능성 면에서 최첨단 시각 언어 모델(VLM) 기준선을 지속적으로 능가하며, 더 정밀한 물리 법칙 발견을 가능하게 합니다. 프로젝트 페이지: https://jiaaqiliu.github.io/VIPER-R1/

English

Automated discovery of physical laws from observational data in the real world is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temporal patterns within dynamic phenomena. To address this gap, we propose VIPER-R1, a multimodal model that performs Visual Induction for Physics-based Equation Reasoning to discover fundamental symbolic formulas. It integrates visual perception, trajectory data, and symbolic reasoning to emulate the scientific discovery process. The model is trained via a curriculum of Motion Structure Induction (MSI), using supervised fine-tuning to interpret kinematic phase portraits and to construct hypotheses guided by a Causal Chain of Thought (C-CoT), followed by Reward-Guided Symbolic Calibration (RGSC) to refine the formula structure with reinforcement learning. During inference, the trained VIPER-R1 acts as an agent: it first posits a high-confidence symbolic ansatz, then proactively invokes an external symbolic regression tool to perform Symbolic Residual Realignment (SR^2). This final step, analogous to a physicist's perturbation analysis, reconciles the theoretical model with empirical data. To support this research, we introduce PhysSymbol, a new 5,000-instance multimodal corpus. Experiments show that VIPER-R1 consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability, enabling more precise discovery of physical laws. Project page: https://jiaaqiliu.github.io/VIPER-R1/

물리학자의 눈을 모방하다: 물리 공식 발견을 위한 VLM 중심 접근법

Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

초록

Support