Tag
This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.
JD released JoyAI-Echo, a long video generation model capable of 5-minute multi-shot video with cross-modal memory for character and voice consistency, native audio+video generation, and 7.5x speed improvement via DMD distillation.
This paper introduces StenCE, a pretraining framework that uses cross-modal contrastive learning between ECG and X-ray angiography representations to detect severe coronary stenosis from ECGs, achieving high performance and enabling early diagnosis even in asymptomatic patients.
TIGER is an inference-time framework that mitigates hallucinations in multimodal generation by extracting observation and claim graphs and assigning risk scores to repair unsupported facts. It reduces unsupported content across image-to-text, image+text-to-text, audio-to-text, and video-to-text tasks.
This paper introduces UniKE, the first benchmark for cross-modal knowledge editing in unified multimodal models (UMMs), revealing a significant modality gap where text edits achieve 92% efficacy but only 18.5% transfer to image generation. It proposes Reasoning-augmented Parameter Editing to improve cross-modal transfer, with gains up to 18.6 percentage points.
LoMo proposes a data curation method that reformulates single-modality prompts into interleaved multimodal sequences to improve cross-modal representation alignment in vision-language models, achieving consistent gains on multiple benchmarks.
Presents SceneAligner, a deep learning approach for floorplan localization that uses 3D scene reconstruction and cross-modal correspondence learning to work in real-world environments with limited data.
This paper systematically investigates cross-modal skill injection, where a domain-expert LLM is merged into a VLM to induce emergent multimodal capabilities. It evaluates different scenarios (instruction-following, cross-lingual, mathematical reasoning), merging methods (TA, DARE, etc.), and hyperparameters, finding that TA and DARE perform well except in mathematical reasoning.
LatentUMM introduces dual latent alignment to improve cross-modal consistency in unified multimodal models by aligning transformations and stabilizing latent dynamics.
AuralSAM2 integrates audio into SAM2 via an AuralFuser module that generates sparse and dense prompts from audio-visual features, enhancing cross-modal segmentation while maintaining interactive efficiency.
MNAFT (Modality Neuron-Aware Fine-Tuning) is a novel approach that selectively updates language-specific and language-agnostic neurons in multimodal large language models to improve image translation while preserving pre-trained knowledge. The method outperforms state-of-the-art image translation techniques including cascaded models and standard fine-tuning approaches.