Tag
ViT-Up introduces a task-agnostic feature upsampler for Vision Transformers that predicts features at arbitrary continuous image coordinates, enabling dense feature maps at any resolution and improving dense prediction and semantic correspondence benchmarks. It outperforms prior state-of-the-art upsamplers, with gains of up to +2.07 mIoU on Cityscapes and +4.17 [email protected] on SPair-71k.
This paper investigates whether the Engram module, an associative memory mechanism, provides content-addressable retrieval in autoregressive image generation or acts as a gated architectural side-pathway.
This paper proposes aligning latent geometry for spherical flow matching, projecting latents onto a fixed-radius sphere and using spherical linear interpolation to improve image generation quality, consistently improving FID on class-conditional ImageNet.