EMMA: Extracting Multiple physical parameters from Multimodal Data
Summary
EMMA is a physics-informed multimodal framework that recovers dynamical parameters from raw video, audio, and image data using a Liquid Time-Constant network and physics-constrained loss, outperforming existing baselines across diverse benchmarks.
View Cached Full Text
Cached at: 06/09/26, 08:41 AM
Paper page - EMMA: Extracting Multiple physical parameters from Multimodal Data
Source: https://huggingface.co/papers/2605.24047
Abstract
EMMA is a physics-informed multimodal framework that directly recovers dynamical parameters from raw video, audio, and image data using a Liquid Time-Constant network and physics-constrained loss.
We introduce EMMA, a physics-informedmultimodal frameworkthat recovers all identifiabledynamical parametersof a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unifiedcontinuous-time model. EMMA leverages aLiquid Time-Constant(LTC) network to learnlatent dynamicsfromheterogeneous modalitieswhile aphysics-constrained lossenforces consistency with the governingdifferential equations. Aunified feature pipelineenables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction fromopportunistic multimodal data. Code and data are available at: https://github.com/ImpactLabASU/EMMA-CVPR2026
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2605\.24047
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.24047 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.24047 in a dataset README.md to link it from this page.
Spaces citing this paper1
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
This article introduces EmoS, a high-fidelity multimodal benchmark designed for fine-grained streaming emotional understanding, addressing limitations in ecological validity and labeling reliability found in existing datasets.
Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals
This paper evaluates deep learning models (LSTM, TCN, Transformer) on the WESAD dataset for multimodal emotion recognition from physiological signals, showing that an ensemble achieves 98.91% accuracy.
LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts
LongMoE proposes a unified framework that jointly addresses modality missingness and longitudinal dynamics in multimodal clinical learning, using context-aware imputation, attentional tokenization, trajectory-aware encoding, and sparse mixture-of-experts routing. Experiments on ADNI, OASIS-3, and MIMIC-IV demonstrate improved robustness under missing modalities while remaining competitive in full-modality settings.
MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding
This paper introduces MultiSeismo, a large-scale multimodal seismic dataset with over 16K events integrating waveforms, intensity maps, and metadata, along with MISCE instruction set and SeisModal, a fine-tuned multimodal model for cross-modal seismic understanding.
Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
This paper investigates the arithmetic limitations of multimodal LLMs on multi-digit multiplication across text, image, and audio modalities, introducing a controlled benchmark and a novel 'arithmetic load' metric (C) that better predicts model accuracy than traditional step-counting methods. Results show accuracy collapses as C grows, and that performance degradation is primarily computational rather than perceptual.