ViT-Up: Faithful Feature Upsampling for Vision Transformers

Hugging Face Daily Papers 06/12/26, 12:00 AM Papers

Summary

ViT-Up introduces a task-agnostic feature upsampler for Vision Transformers that predicts features at arbitrary continuous image coordinates, enabling dense feature maps at any resolution and improving dense prediction and semantic correspondence benchmarks. It outperforms prior state-of-the-art upsamplers, with gains of up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k.

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 [email protected], demonstrating that ViT-Up scales favorably with backbone capacity.

Original Article

View Cached Full Text

Cached at: 06/18/26, 03:58 PM

Paper page - ViT-Up: Faithful Feature Upsampling for Vision Transformers

Source: https://huggingface.co/papers/2606.14024 Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers. ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.

Pretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.

ViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:

+2.07 mIoU on Cityscapes
+4.17[email protected]on SPair-71k

The project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:

https://vitup.papers.discuna.com/

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Paper page - ViT-Up: Faithful Feature Upsampling for Vision Transformers

Similar Articles

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Towards Consistent Video Geometry Estimation

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Linearizing Vision Transformer with Test-Time Training

Submit Feedback

Similar Articles

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Towards Consistent Video Geometry Estimation

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Linearizing Vision Transformer with Test-Time Training