Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

arXiv cs.AI 06/10/26, 04:00 AM Papers

Summary

This paper proposes a cross-modal knowledge distillation framework that works without paired data by aligning feature and label distributions, offering theoretical guarantees and outperforming prior methods on multimodal benchmarks.

arXiv:2606.10504v1 Announce Type: new Abstract: Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.

Original Article

View Cached Full Text

Cached at: 06/10/26, 06:15 AM

# Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm
Source: [https://arxiv.org/abs/2606.10504](https://arxiv.org/abs/2606.10504)
[View PDF](https://arxiv.org/pdf/2606.10504)

> Abstract:Cross\-modal knowledge distillation \(CMKD\) studies how a \(large\) teacher model trained on one type of data \(e\.g\., images\) can guide a \(smaller\) student model building on another type of data \(e\.g\., text/audio\)\. Existing CMKD methods often require paired multi\-modal data with aligned semantics, but obtaining such paired data are often costly and impractical\. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable\. In particular, we establish a cross\-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment\. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively\. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross\-modal knowledge distillation by aligning distributions rather than individual samples\. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work\.

## Submission history

From: Khiem Tran Trong \[[view email](https://arxiv.org/show-email/9b266a36/2606.10504)\] **\[v1\]**Tue, 9 Jun 2026 07:29:12 UTC \(1,693 KB\)

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

Similar Articles

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

OPRD: On-Policy Representation Distillation

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity

On-Policy Distillation (5 minute read)

Submit Feedback

Similar Articles

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

OPRD: On-Policy Representation Distillation

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity

On-Policy Distillation (5 minute read)