EMMA：從多模態數據中萃取多個物理參數

摘要

我們提出了EMMA，一個物理信息驅動的多模態框架，能直接從原始影片、音訊及基於圖像的時間序列觀測中，恢復系統所有可識別的動態參數。不同於先前僅依賴影片的方法——這些方法常受限於被遮蔽的狀態、隱藏的致動輸入，或需假設已知初始條件與座標系——EMMA在一個統一的連續時間模型中，對顯式參數、隱式動態分量及校準不變量進行聯合推論。EMMA利用液態時間常數網路（LTC）從異質模態中學習潛在動態，同時透過物理約束的損失函數，確保與控制微分方程式的一致性。統一的特徵處理管線能實現影片軌跡、聲學特徵及圖表量測值之間的一致對齊，使EMMA能在無需分割遮罩、可微分渲染或專用感測器的情況下，估計強迫動態、隱式動態及多變量動態中的參數。在超過100個情境中，包含五個標準動態基準測試（75段Delfys影片）、具有隱藏輸入的真實世界無人車及四軸飛行器系統，以及涵蓋生物與混沌系統的模擬圖表案例研究，EMMA展現出穩健的多參數恢復能力，並顯著優於現有的單一模態及方程式發現基線方法。我們的結果確立了EMMA作為一種通用且可擴展的解決方案，能從機會性多模態資料中提取符合物理一致性的模型。程式碼與資料可於以下網址取得：https://github.com/ImpactLabASU/EMMA-CVPR2026

English

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data. Code and data are available at: https://github.com/ImpactLabASU/EMMA-CVPR2026