From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion
Summary
This paper introduces a multimodal image fusion method that uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing (STE). Experiments on four benchmarks show state-of-the-art performance in both global coherence and local fidelity.
View Cached Full Text
Cached at: 06/12/26, 10:52 AM
Paper page - From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion
Source: https://huggingface.co/papers/2606.12303
Abstract
A multimodal image fusion approach uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing.
Multimodal image fusionaims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact1D token interfacebased on a frozenpretrained image tokenizerfor modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway forlocal structure restoration. Specifically, we introduceSelective Token Editing(STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steerglobal appearance coherencewhile keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: https://zju-xyc.github.io/1D-Fusion-Project-Page/
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.12303
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.12303 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.12303 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12303 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.
(1D) Ordered Tokens Enable Efficient Test-Time Search
This paper investigates how 1D coarse-to-fine token structures in autoregressive models improve test-time search efficiency compared to classical 2D grid tokenization. The authors show that such ordered tokens enable better test-time scaling and even training-free text-to-image generation when guided by image-text verifiers.
Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing.
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
This paper proposes DPVR-LF, a modality-asymmetric routing framework for MLLMs that routes vision tokens at their saturation point into a lightweight side branch and performs late fusion, reducing visual computation while maintaining competitive performance.
The Galaxy's Guide to the Tokenizer: A Benchmark for Scientific Foundation Models
This paper compares four tokenization methods (Affine, AIM, JetFormer, VQ-VAE) for astronomical images within a unified transformer framework, using 640,000 galaxy images to evaluate reconstruction quality, physical property prediction, and morphological preservation. It finds that no single method excels across all tasks, highlighting trade-offs in representation learning.