Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm
Summary
This paper proposes a cross-modal knowledge distillation framework that works without paired data by aligning feature and label distributions, offering theoretical guarantees and outperforming prior methods on multimodal benchmarks.
View Cached Full Text
Cached at: 06/10/26, 06:15 AM
# Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm Source: [https://arxiv.org/abs/2606.10504](https://arxiv.org/abs/2606.10504) [View PDF](https://arxiv.org/pdf/2606.10504) > Abstract:Cross\-modal knowledge distillation \(CMKD\) studies how a \(large\) teacher model trained on one type of data \(e\.g\., images\) can guide a \(smaller\) student model building on another type of data \(e\.g\., text/audio\)\. Existing CMKD methods often require paired multi\-modal data with aligned semantics, but obtaining such paired data are often costly and impractical\. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable\. In particular, we establish a cross\-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment\. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively\. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross\-modal knowledge distillation by aligning distributions rather than individual samples\. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work\. ## Submission history From: Khiem Tran Trong \[[view email](https://arxiv.org/show-email/9b266a36/2606.10504)\] **\[v1\]**Tue, 9 Jun 2026 07:29:12 UTC \(1,693 KB\)
Similar Articles
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
Switch-KD proposes a novel visual-switch knowledge distillation framework for efficiently compressing vision-language models by unifying multimodal knowledge transfer within a shared text-probability space. The method achieves 3.6-point average improvement across 10 multimodal benchmarks when distilling a 0.5B TinyLLaVA student from a 3B teacher model.
OPRD: On-Policy Representation Distillation
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation
X-Token introduces two loss formulations (P-KL and H-KL) to address failure modes in logit-based cross-tokenizer knowledge distillation, enabling a student model to learn from teachers with incompatible vocabularies and achieving state-of-the-art results on Llama-3.2-1B.
Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity
This paper proposes FedMPO, a robust federated multimodal graph learning method that addresses modality heterogeneity and missing modalities through topology-aware cross-modal generation, missing-aware expert routing, and reliability-aware aggregation, achieving performance gains on multiple datasets.
On-Policy Distillation (5 minute read)
This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.