Tag
This paper introduces a multimodal image fusion method that uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing (STE). Experiments on four benchmarks show state-of-the-art performance in both global coherence and local fidelity.