vision-transformer

#vision-transformer

Patch-PODiff-ViT: Structured Latent Diffusion with Patchwise POD for Super-Resolution and Uncertainty Quantification

arXiv cs.LG ↗ · 2d ago Cached

Patch-PODiff-ViT introduces a structured latent diffusion framework using patchwise Proper Orthogonal Decomposition (POD) for super-resolution and uncertainty quantification, enabling efficient diffusion with a fixed linear orthonormal basis and analytic propagation of predictive variance.

0 favorites 0 likes

#vision-transformer

MuSViT: A Foundation Vision Model for Sheet Music Representation

Hugging Face Daily Papers ↗ · 3d ago Cached

MuSViT is the first foundation vision model for sheet music, pre-trained on millions of pages via Masked Autoencoders, achieving superior performance in score recognition and symbol detection tasks.

0 favorites 0 likes

#vision-transformer

I made a superhuman Generals.io agent with self-play RL [P]

Reddit r/MachineLearning ↗ · 2026-06-24

Trained a superhuman Generals.io agent using self-play reinforcement learning with a JAX-based pipeline and Vision Transformer. Achieved #1 on human 1v1 leaderboard; all code and a fast JAX simulator open-sourced.

0 favorites 0 likes

#vision-transformer

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, achieving strong performance across understanding and generation tasks.

0 favorites 0 likes

#vision-transformer

Linearizing Vision Transformer with Test-Time Training

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.

0 favorites 0 likes

#vision-transformer

Rust implementations of vision transformer models

Reddit r/ArtificialInteligence ↗ · 2026-05-24

A Rust crate for building and experimenting with Vision Transformer (ViT) models, providing typed configs, reusable structs, and runnable examples for research and production.

0 favorites 0 likes

#vision-transformer

A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

arXiv cs.LG ↗ · 2026-05-08 Cached

This paper proposes a new architecture that augments Flux Neural Operators with recurrent Vision Transformers to solve conservation laws as a foundation model. It demonstrates robust generalization and long-time prediction capabilities across diverse conservative systems without explicit access to governing equations.

0 favorites 0 likes

vision-transformer

Patch-PODiff-ViT: Structured Latent Diffusion with Patchwise POD for Super-Resolution and Uncertainty Quantification

MuSViT: A Foundation Vision Model for Sheet Music Representation

I made a superhuman Generals.io agent with self-play RL [P]

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Linearizing Vision Transformer with Test-Time Training

Rust implementations of vision transformer models

A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

Submit Feedback