ViT-Up: Faithful Feature Upsampling for Vision Transformers
Summary
ViT-Up introduces a task-agnostic feature upsampler for Vision Transformers that predicts features at arbitrary continuous image coordinates, enabling dense feature maps at any resolution and improving dense prediction and semantic correspondence benchmarks. It outperforms prior state-of-the-art upsamplers, with gains of up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k.
View Cached Full Text
Cached at: 06/18/26, 03:58 PM
Paper page - ViT-Up: Faithful Feature Upsampling for Vision Transformers
Source: https://huggingface.co/papers/2606.14024 Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers. ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.
Pretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.
ViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:
- +2.07 mIoU on Cityscapes
- +4.17[email protected]on SPair-71k
The project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:
Similar Articles
UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation
UniverSat introduces a Universal Patch Encoder for Vision Transformers that enables robust, sensor-agnostic spatial feature extraction across diverse Earth Observation data types, achieving strong results on classification and segmentation benchmarks.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
TIPSv2 introduces enhanced vision-language pretraining techniques including patch-level distillation, an upgraded masked image objective (iBOT++), and improved caption sampling strategies to achieve superior dense patch-text alignment. The resulting family of image-text encoder models demonstrates strong performance across 9 tasks and 20 datasets.
Towards Consistent Video Geometry Estimation
ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework, achieving state-of-the-art performance across multiple tasks.
UniT: Unified Geometry Learning with Group Autoregressive Transformer
UniT is a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms (online/offline, multi-modal, long-horizon) while maintaining metric-scale accuracy via scale-adaptive loss and queue-style KV caching. It achieves state-of-the-art performance on ten benchmarks spanning seven tasks.
Linearizing Vision Transformer with Test-Time Training
This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.
