RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation
Summary
RaysUp is an ultra-lightweight, task-agnostic feature upsampling framework that uses geometry-aware ray domain techniques to reconstruct high-resolution features from low-resolution VFM outputs, achieving state-of-the-art performance with 84% fewer parameters than prior work and 7x faster inference.
View Cached Full Text
Cached at: 06/30/26, 03:37 PM
Paper page - RaysUp: Ultra-light Universal Feature Upsampling via Geometry-Aware Ray Representation
Source: https://huggingface.co/papers/2606.22749
Abstract
RaysUp is a lightweight, task-agnostic feature upsampling framework that reconstructs high-resolution features using geometry-aware ray domain techniques with improved efficiency and accuracy.
Pre-trainedVision Foundation Models(VFMs) have become central to modern computer vision due to their powerful semantic representations and strong generalization ability. However, their patchified or pooled outputs are inherently low-resolution, limiting their effectiveness in tasks requiring fine-grained, pixel-level reasoning. Existingfeature upsamplingapproaches either degrade semantic fidelity or rely on VFM-specific retraining and heavy architectures, hindering efficiency and scalability. To address these challenges, we propose RaysUp, an ultra-lightweight, task-agnostic, and VFM-agnosticfeature upsamplingframework that reconstructs high-resolution feature maps at arbitrary resolutions. Unlike conventional 2D interpolation or attention-based schemes, RaysUp lifts feature reconstruction into a geometry-aware ray domain. Specifically, we introduce aSpatially Decoupled Guidance Encoderfor direction-aware guidance encoding, anAny-Resolution Cross-Attentionmechanism for resolution-flexible reconstruction, and a novelRay Positional Encoding(RayPE) that injects implicit 3D geometric priors via6D Plucker ray coordinates. Finally, aGeometry-Aware Neighborhood Attentionmodule further ensures content-adaptive bilateral aggregation while preserving geometric consistency. Extensive experiments across diversedense prediction tasksdemonstrate that RaysUp achieves state-of-the-art performance while using only 16% of the parameters of AnyUp and delivering approximately 7x faster inference. These results highlight a substantially improved accuracy-efficiency trade-off and establish RaysUp as a practical and scalable solution for universalfeature upsampling. Code is available at https://github.com/MAP-RaysUp/RaysUp.
View arXiv pageView PDFProject pageGitHub10Add to collection
Get this paper in your agent:
hf papers read 2606\.22749
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.22749 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.22749 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.22749 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ViT-Up: Faithful Feature Upsampling for Vision Transformers
ViT-Up introduces a task-agnostic feature upsampler for Vision Transformers that predicts features at arbitrary continuous image coordinates, enabling dense feature maps at any resolution and improving dense prediction and semantic correspondence benchmarks. It outperforms prior state-of-the-art upsamplers, with gains of up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k.
RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
RayDer is a unified feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering for self-supervised novel view synthesis from real-world video, achieving clean power-law scaling and strong zero-shot performance.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
UltraFlux introduces a data-model co-design approach for native 4K text-to-image generation across diverse aspect ratios, addressing positional encoding, VAE compression, and optimization challenges. It outperforms existing open-source baselines and matches proprietary models like Seedream 4.0.
SurGe: Improved Surface Geometry in Point Maps
SurGe introduces a Neighborhood Attention Decoder and a reformulated scale-invariant gradient matching loss to improve local surface geometry accuracy in feedforward 3D reconstruction, particularly for thin structures. It achieves state-of-the-art average rank on zero-shot monocular geometry benchmarks, with better local point map and normal metrics.