EMMA: Извлечение множества физических параметров из мультимодальных данных

Аннотация

我们提出EMMA，一种融合物理信息的多模态框架，能够直接从原始视频、音频和基于图像的时间序列观测数据中恢复系统的所有可识别动力学参数。与先前仅依赖视频的方法不同（这些方法在处理遮挡状态、隐藏驱动输入或假设已知初始条件和坐标系时存在困难），EMMA在统一的连续时间模型中对显式参数、隐式动力学分量和校准不变量进行联合推理。EMMA利用液态时间常数（LTC）网络从异质模态中学习潜在动力学，同时通过物理约束损失函数确保与主导微分方程的一致性。统一的特征管道实现了视频轨迹、声学特征和图表衍生测量之间的对齐，使EMMA能够在受迫、隐式和多变量动力学条件下估计参数，无需分割掩码、可微渲染或专用传感器。在超过100个场景中（包括五个标准动力学基准（75个Delfys视频）、具有隐藏输入的真实世界巡视器和四旋翼系统，以及涵盖生物和混沌系统的仿真图表案例研究），EMMA实现了稳健的多参数恢复，并显著优于现有的单模态和方程发现基线方法。我们的结果确立了EMMA作为从机会性多模态数据中提取物理一致模型的一种通用、可扩展的解决方案。代码和数据详见：https://github.com/ImpactLabASU/EMMA-CVPR2026

English

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data. Code and data are available at: https://github.com/ImpactLabASU/EMMA-CVPR2026