Information-Theoretic Decomposition for Multimodal Interaction Learning

arXiv cs.LG Papers

Summary

This paper presents an information-theoretic analysis of multimodal learning, revealing the need to capture sample-specific interactions, and proposes DMIL, a paradigm that explicitly models and learns from these interactions via variational decomposition and fine-tuning, achieving superior performance.

arXiv:2606.11614v1 Announce Type: new Abstract: Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at https://github.com/GeWu-Lab/DMIL.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:49 PM

# Information-Theoretic Decomposition for Multimodal Interaction Learning
Source: [https://arxiv.org/abs/2606.11614](https://arxiv.org/abs/2606.11614)
[View PDF](https://arxiv.org/pdf/2606.11614)

> Abstract:Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions\. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples\. In this work, we present the first systematic, information\-theoretic analysis highlighting why learning these dynamic, sample\-specific interactions is critical for effective multimodal learning\. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under\-utilize redundant information\. This highlights the need for an approach that can adaptively learn from different interaction types on a per\-sample basis\. To this end, we propose Decomposition\-based Multimodal Interaction Learning \(DMIL\), a novel paradigm that explicitly models and learns from sample\-specific interactions\. First, we design a variational decomposition architecture to isolate the constituent interaction components\. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine\-tuning process to achieve comprehensive interaction learning\. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample\-specific interactions\. Our framework is flexible and broadly applicable, establishing an interaction\-centric paradigm for multimodal learning\. The code is available at[this https URL](https://github.com/GeWu-Lab/DMIL)\.

## Submission history

From: Zequn Yang \[[view email](https://arxiv.org/show-email/cb896b1a/2606.11614)\] **\[v1\]**Wed, 10 Jun 2026 03:27:13 UTC \(3,575 KB\)

Similar Articles

Learning to Learn from Multimodal Experience

arXiv cs.AI

This paper introduces AutoMMemo, a framework that enables multimodal agents to automatically design memory mechanisms (expressible as executable memo programs) for learning from multimodal interaction trajectories, outperforming no-memory and fixed-memory baselines on GUI/Web navigation and visual reasoning benchmarks.