Linearizing Vision Transformer with Test-Time Training
Summary
This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.
View Cached Full Text
Cached at: 06/02/26, 03:24 AM
Paper page - Linearizing Vision Transformer with Test-Time Training
Source: https://huggingface.co/papers/2605.02772
Abstract
Researchers develop a method to convert pretrained Softmax attention models to linear-complexity Test-Time Training architectures through architectural and representational alignment, achieving fast inference with minimal fine-tuning.
Whilelinear-complexity attentionmechanisms offer a promising alternative toSoftmax attentionfor overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamentalrepresentational gapbetween Softmax and linear attention prevents effectiveweight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identifyTest-Time Training(TTT) as a linear-complexity architecture whosetwo-layer dynamic formulationis structurally aligned withSoftmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introducekey instance normalizationand a lightweightlocality enhancement module. We validate our approach by linearizingStable Diffusion 3.5and introduce SD3.5-T^5 (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4timesH20 GPUs, SD3.5-T^5 achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32times and 1.47times at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.
View arXiv pageView PDFGitHub7Add to collection
Get this paper in your agent:
hf papers read 2605\.02772
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.02772 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.02772 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.02772 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
Lens is a compact 3.8B-parameter text-to-image model from Microsoft that achieves competitive performance with larger models while requiring significantly less training compute, using dense captions, multi-resolution batching, and efficient architecture.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization
This paper introduces Vertex-Softmax, a method for tight Transformer verification by proving that exact softmax optimization over interval constraints occurs at vertices of the constraint box. It improves certified accuracy and efficiency in CROWN-style verifiers for attention models on standard datasets.
@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432
Wall Attention generalizes diagonal forget gates to softmax attention, enabling state-of-the-art length extrapolation from 4k to 160k+ context zero-shot and outperforming RoPE and FoX in pretraining. It is released as a drop-in replacement with open-source Triton kernels.
Efficient Pre-Training with Token Superposition
Token-Superposition Training (TST) improves LLM pre-training efficiency by combining contiguous tokens into bags during a superposition phase with a multi-hot cross-entropy objective, achieving up to 2.5x reduction in training time without architectural changes.