HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing
Summary
HiLo-Token introduces an input-adaptive token compression framework for Diffusion Transformers that allocates more tokens to high-frequency regions, achieving up to 3.13x speedup in image editing tasks without quality loss.
View Cached Full Text
Cached at: 06/18/26, 07:58 PM
Paper page - HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing
Source: https://huggingface.co/papers/2606.13898
Abstract
A novel token compression framework called HiLo-Token is introduced to accelerate Diffusion Transformers in image editing tasks by adaptively allocating tokens based on spatial frequency and context importance, achieving significant speedups without quality loss.
Creativeimage editingtools, such as Photoshop’s Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, currentgenerative AImodels face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets toDiffusion Transformers(DiTs). In our evaluation on hundreds of representativeimage editingsamples spanning a wide range of mask ratios, theDiTmodule alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose HiLo-Token, an input-adaptivetoken compressionframework that allocates moretoken budgetto high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effectivehigh-frequency token selectionstrategy based onspatial frequencyto capture important local details, while using tokens from a 16x downsampled image to representlow-frequency componentsand preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67xDiTspeedups on A100-80GB forimage editingtasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.13898
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.13898 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.13898 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.13898 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]
This paper introduces an adaptive video tokenisation method that exploits temporal redundancy in latent space to allocate tokens dynamically, achieving efficient compression without auxiliary networks. The proposed Latent Inpainting Transformer reconstructs dropped positions, delivering 31x speedup over ElasticTok-CV and 2x over InfoTok.
EarlyTom: Early Token Compression Completes Fast Video Understanding
EarlyTom is a training-free framework that compresses visual tokens early in the vision encoder to reduce time-to-first-token and computational costs while maintaining accuracy, achieving up to 2.65x TTFT reduction.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
InsightTok introduces content-aware perceptual losses to improve discrete visual tokenization for better text and face reconstruction, enhancing autoregressive image generation quality.
Compute Optimal Tokenization (2 minute read)
This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.
Balancing Image Compression and Generation with Bootstrapped Tokenization
Introduces SelfBootTok, a self-bootstrapped tokenization method that separates global and local information, reducing generator computation by ~40% and achieving a new state-of-the-art gFID of 1.56 with only 64 tokens.