cross-modal

#cross-modal

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

arXiv cs.CL ↗ · 5d ago Cached

This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.

0 favorites 0 likes

#cross-modal

@AdinaYakup: JD just released JoyAI-Echo An interesting long video generation model 5 minute multi shot video generation Cross modal…

X AI KOLs Following ↗ · 6d ago Cached

JD released JoyAI-Echo, a long video generation model capable of 5-minute multi-shot video with cross-modal memory for character and voice consistency, native audio+video generation, and 7.5x speed improvement via DMD distillation.

0 favorites 0 likes

#cross-modal

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

arXiv cs.LG ↗ · 6d ago Cached

This paper introduces StenCE, a pretraining framework that uses cross-modal contrastive learning between ECG and X-ray angiography representations to detect severe coronary stenosis from ECGs, achieving high performance and enabling early diagnosis even in asymptomatic patients.

0 favorites 0 likes

#cross-modal

TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

arXiv cs.AI ↗ · 2026-06-02 Cached

TIGER is an inference-time framework that mitigates hallucinations in multimodal generation by extracting observation and claim graphs and assigning risk scores to repair unsupported facts. It reduces unsupported content across image-to-text, image+text-to-text, audio-to-text, and video-to-text tasks.

0 favorites 0 likes

#cross-modal

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

arXiv cs.CL ↗ · 2026-06-02 Cached

This paper introduces UniKE, the first benchmark for cross-modal knowledge editing in unified multimodal models (UMMs), revealing a significant modality gap where text edits achieve 92% efficacy but only 18.5% transfer to image generation. It proposes Reasoning-augmented Parameter Editing to improve cross-modal transfer, with gains up to 18.6 percentage points.

0 favorites 0 likes

#cross-modal

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

LoMo proposes a data curation method that reformulates single-modality prompts into interleaved multimodal sequences to improve cross-modal representation alignment in vision-language models, achieving consistent gains on multiple benchmarks.

0 favorites 0 likes

#cross-modal

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

Hugging Face Daily Papers ↗ · 2026-05-21 Cached

Presents SceneAligner, a deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.

0 favorites 0 likes

#cross-modal

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

arXiv cs.CL ↗ · 2026-05-20 Cached

This paper systematically investigates cross-modal skill injection, where a domain-expert LLM is merged into a VLM to induce emergent multimodal capabilities. It evaluates different scenarios (instruction-following, cross-lingual, mathematical reasoning), merging methods (TA, DARE, etc.), and hyperparameters, finding that TA and DARE perform well except in mathematical reasoning.

0 favorites 0 likes

#cross-modal

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

LatentUMM introduces dual latent alignment to improve cross-modal consistency in unified multimodal models by aligning transformations and stabilizing latent dynamics.

0 favorites 0 likes

#cross-modal

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Hugging Face Daily Papers ↗ · 2026-05-14 Cached

AuralSAM2 integrates audio into SAM2 via an AuralFuser module that generates sparse and dense prompts from audio-visual features, enhancing cross-modal segmentation while maintaining interactive efficiency.

0 favorites 0 likes

#cross-modal

MNAFT: modality neuron-aware fine-tuning of multimodal large language models for image translation

Hugging Face Daily Papers ↗ · 2026-04-18 Cached

MNAFT (Modality Neuron-Aware Fine-Tuning) is a novel approach that selectively updates language-specific and language-agnostic neurons in multimodal large language models to improve image translation while preserving pre-trained knowledge. The method outperforms state-of-the-art image translation techniques including cascaded models and standard fine-tuning approaches.

0 favorites 0 likes

cross-modal

Submit Feedback