local-structure

#local-structure

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

This paper introduces a multimodal image fusion method that uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing (STE). Experiments on four benchmarks show state-of-the-art performance in both global coherence and local fidelity.

0 favorites 0 likes

local-structure

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Submit Feedback