ViT-Up: Faithful Feature Upsampling for Vision Transformers

Hugging Face Daily Papers Papers

Summary

ViT-Up introduces a task-agnostic feature upsampler for Vision Transformers that predicts features at arbitrary continuous image coordinates, enabling dense feature maps at any resolution and improving dense prediction and semantic correspondence benchmarks. It outperforms prior state-of-the-art upsamplers, with gains of up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k.

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 [email protected], demonstrating that ViT-Up scales favorably with backbone capacity.
Original Article
View Cached Full Text

Cached at: 06/18/26, 03:58 PM

Paper page - ViT-Up: Faithful Feature Upsampling for Vision Transformers

Source: https://huggingface.co/papers/2606.14024 Introducing ViT-Up: A state-of-the-art task-agnostic feature upsampler for Vision Transformers. ViT-Up predicts features at ⭐ arbitrary continuous image coordinates ⭐, enabling dense feature maps at any resolution and sample-aware vision pipelines that query features only where they are needed.

Pretrained through self-supervised feature distillation on over one million ImageNet-1K images, it supports data-constrained dense prediction and fine-grained correspondence by letting downstream heads operate directly on dense DINOv3 features.

ViT-Up outperforms prior state-of-the-art feature upsamplers across dense prediction and semantic correspondence benchmarks. On DINOv3-S+, ViT-Up improves over prior methods by up to:

The project page includes pretrained checkpoints, code for training and evaluation, quantitative results, qualitative comparisons, the arXiv preprint, and a Google Colab demo:

https://vitup.papers.discuna.com/

Group 531 (2)

Similar Articles

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Hugging Face Daily Papers

TIPSv2 introduces enhanced vision-language pretraining techniques including patch-level distillation, an upgraded masked image objective (iBOT++), and improved caption sampling strategies to achieve superior dense patch-text alignment. The resulting family of image-text encoder models demonstrates strong performance across 9 tasks and 20 datasets.

Towards Consistent Video Geometry Estimation

Hugging Face Daily Papers

ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework, achieving state-of-the-art performance across multiple tasks.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Hugging Face Daily Papers

UniT is a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms (online/offline, multi-modal, long-horizon) while maintaining metric-scale accuracy via scale-adaptive loss and queue-style KV caching. It achieves state-of-the-art performance on ten benchmarks spanning seven tasks.

Linearizing Vision Transformer with Test-Time Training

Hugging Face Daily Papers

This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.