EMMA：マルチモーダルデータからの複数の物理パラメータの抽出

要旨

我々はEMMAを紹介する。これは物理情報に基づくマルチモーダルフレームワークであり、生のビデオ、音声、画像ベースの時系列観測から直接、システムのすべての識別可能な力学パラメータを復元する。先行するビデオのみの手法が、遮蔽状態、隠れたアクチュエーション入力、あるいは既知の初期条件や座標系に関する仮定に苦慮していたのに対し、EMMAは統一された連続時間モデル内で、明示的なパラメータ、暗黙的な力学成分、および較正不変量の共同推論を実行する。EMMAはLiquid Time-Constant（LTC）ネットワークを活用して異種モダリティから潜在的な力学を学習し、物理制約付き損失関数によって支配微分方程式との整合性を強制する。統一された特徴パイプラインにより、ビデオ軌跡、音響シグネチャ、チャート由来の計測値にわたって一貫した位置合わせが可能となり、EMMAはセグメンテーションマスク、微分可能レンダリング、特殊センサーを必要とせずに、強制・暗黙・多変量の力学下でパラメータを推定できる。5つの標準力学ベンチマーク（75のDelfysビデオ）、隠れた入力を含む実世界のローバーおよびクワッドローターシステム、生物系やカオス系を対象としたシミュレーションチャートのケーススタディを含む100以上のシナリオにおいて、EMMAは頑健なマルチパラメータ復元を実現し、既存のシングルモダリティ手法や方程式発見ベースラインを大幅に上回る性能を示した。これらの結果は、EMMAが日和見的なマルチモーダルデータから物理整合的なモデルを抽出するための汎用的かつスケーラブルなソリューションであることを確立する。コードとデータは以下で入手可能：https://github.com/ImpactLabASU/EMMA-CVPR2026

English

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data. Code and data are available at: https://github.com/ImpactLabASU/EMMA-CVPR2026