Tag
Patch-PODiff-ViT introduces a structured latent diffusion framework using patchwise Proper Orthogonal Decomposition (POD) for super-resolution and uncertainty quantification, enabling efficient diffusion with a fixed linear orthonormal basis and analytic propagation of predictive variance.
MuSViT is the first foundation vision model for sheet music, pre-trained on millions of pages via Masked Autoencoders, achieving superior performance in score recognition and symbol detection tasks.
Trained a superhuman Generals.io agent using self-play reinforcement learning with a JAX-based pipeline and Vision Transformer. Achieved #1 on human 1v1 leaderboard; all code and a fast JAX simulator open-sourced.
HYDRA-X presents a unified multimodal model that integrates image and video tokenization within a single Vision Transformer, achieving strong performance across understanding and generation tasks.
This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.
A Rust crate for building and experimenting with Vision Transformer (ViT) models, providing typed configs, reusable structs, and runnable examples for research and production.
This paper proposes a new architecture that augments Flux Neural Operators with recurrent Vision Transformers to solve conservation laws as a foundation model. It demonstrates robust generalization and long-time prediction capabilities across diverse conservative systems without explicit access to governing equations.