Tag
A GitHub repository providing minimal, standalone PyTorch reimplementations of JEPA family models (I-JEPA, V-JEPA, V-JEPA 2, C-JEPA) for educational purposes, including tutorials and visualization tools.
GridProbe is a training-free inference paradigm for Long-Video VLMs that adaptively selects relevant frames using posterior probing, achieving sub-quadratic attention costs with minimal accuracy loss.
NomadicAI is building an agentic computer-vision product to fix VLMs' weak grounding in actual video content.
SignX proposes a novel framework for continuous sign language recognition that unifies heterogeneous pose formats into a compact latent space and achieves state-of-the-art accuracy with 50× computational acceleration over pixel-space baselines.
EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.
Pegasus 1.5 is an AI model by TwelveLabs designed to transform video content into time-based metadata, enabling automated video understanding and indexing.
This paper introduces OmniScript, an 8B-parameter omni-modal (audio-visual) language model for a novel video-to-script (V2S) task that generates hierarchical, scene-by-scene scripts from long-form cinematic videos. Trained via progressive pipeline techniques including chain-of-thought SFT and reinforcement learning with temporally segmented rewards, OmniScript outperforms larger open-source models and rivals proprietary models like Gemini 3-Pro.