Information-Theoretic Decomposition for Multimodal Interaction Learning
Summary
This paper presents an information-theoretic analysis of multimodal learning, revealing the need to capture sample-specific interactions, and proposes DMIL, a paradigm that explicitly models and learns from these interactions via variational decomposition and fine-tuning, achieving superior performance.
View Cached Full Text
Cached at: 06/11/26, 01:49 PM
# Information-Theoretic Decomposition for Multimodal Interaction Learning Source: [https://arxiv.org/abs/2606.11614](https://arxiv.org/abs/2606.11614) [View PDF](https://arxiv.org/pdf/2606.11614) > Abstract:Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions\. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples\. In this work, we present the first systematic, information\-theoretic analysis highlighting why learning these dynamic, sample\-specific interactions is critical for effective multimodal learning\. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under\-utilize redundant information\. This highlights the need for an approach that can adaptively learn from different interaction types on a per\-sample basis\. To this end, we propose Decomposition\-based Multimodal Interaction Learning \(DMIL\), a novel paradigm that explicitly models and learns from sample\-specific interactions\. First, we design a variational decomposition architecture to isolate the constituent interaction components\. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine\-tuning process to achieve comprehensive interaction learning\. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample\-specific interactions\. Our framework is flexible and broadly applicable, establishing an interaction\-centric paradigm for multimodal learning\. The code is available at[this https URL](https://github.com/GeWu-Lab/DMIL)\. ## Submission history From: Zequn Yang \[[view email](https://arxiv.org/show-email/cb896b1a/2606.11614)\] **\[v1\]**Wed, 10 Jun 2026 03:27:13 UTC \(3,575 KB\)
Similar Articles
Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs
This paper identifies and addresses the 'editing decoupling failure' in Multimodal LLMs, where knowledge updates via multimodal inputs fail to generalize to unimodal queries. The authors propose DECODE, a method to disentangle and localize modality-specific neurons for more effective knowledge editing.
SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning
The article introduces SynIB, a scalable objective based on the information bottleneck principle that targets synergistic information in multimodal learning by penalizing confident predictions when a modality is masked, improving performance on tasks requiring cross-modal reasoning.
CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning
This paper proposes CL-DMDF, a dynamic multimodal data fusion model that uses contrastive learning and a dual-dimensional attention mechanism to handle missing modalities and improve discriminative learning.
Learning to Learn from Multimodal Experience
This paper introduces AutoMMemo, a framework that enables multimodal agents to automatically design memory mechanisms (expressible as executable memo programs) for learning from multimodal interaction trajectories, outperforming no-memory and fixed-memory baselines on GUI/Web navigation and visual reasoning benchmarks.
ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection
ThinkDeception proposes a novel framework that leverages multimodal large language models and a progressive reinforcement learning strategy with chain-of-thought reasoning for interpretable deception detection, achieving new state-of-the-art results on standard benchmarks.