video-understanding

#video-understanding

GitHub - keon/jepa: implementing minimal versions of joint-embedding predictive architecture (JEPA)

Reddit r/ArtificialInteligence ↗ · 22h ago Cached

A GitHub repository providing minimal, standalone PyTorch reimplementations of JEPA family models (I-JEPA, V-JEPA, V-JEPA 2, C-JEPA) for educational purposes, including tutorials and visualization tools.

0 favorites 0 likes

#video-understanding

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

Hugging Face Daily Papers ↗ · 2d ago Cached

GridProbe is a training-free inference paradigm for Long-Video VLMs that adaptively selects relevant frames using posterior probing, achieving sub-quadratic attention costs with minimal accuracy loss.

0 favorites 0 likes

#video-understanding

@nomadicai: The future of computer vision is agentic. 1/ We built Nomadic around a gap we kept seeing in video understanding: VLMs …

X AI KOLs Following ↗ · 2026-04-21 Cached

NomadicAI is building an agentic computer-vision product to fix VLMs' weak grounding in actual video content.

0 favorites 0 likes

#video-understanding

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

arXiv cs.CL ↗ · 2026-04-20 Cached

SignX proposes a novel framework for continuous sign language recognition that unifies heterogeneous pose formats into a compact latent space and achieves state-of-the-art accuracy with 50× computational acceleration over pixel-space baselines.

0 favorites 0 likes

#video-understanding

EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers ↗ · 2026-04-18 Cached

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.

0 favorites 0 likes

#video-understanding

Pegasus 1.5 by TwelveLabs

Product Hunt ↗ · 2026-04-14

Pegasus 1.5 is an AI model by TwelveLabs designed to transform video content into time-based metadata, enabling automated video understanding and indexing.

0 favorites 0 likes

#video-understanding

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Hugging Face Daily Papers ↗ · 2026-04-13 Cached

This paper introduces OmniScript, an 8B-parameter omni-modal (audio-visual) language model for a novel video-to-script (V2S) task that generates hierarchical, scene-by-scene scripts from long-form cinematic videos. Trained via progressive pipeline techniques including chain-of-thought SFT and reinforcement learning with temporally segmented rewards, OmniScript outperforms larger open-source models and rivals proprietary models like Gemini 3-Pro.

0 favorites 0 likes

video-understanding

GitHub - keon/jepa: implementing minimal versions of joint-embedding predictive architecture (JEPA)

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

@nomadicai: The future of computer vision is agentic. 1/ We built Nomadic around a gap we kept seeing in video understanding: VLMs …

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

EasyVideoR1: Easier RL for Video Understanding

Pegasus 1.5 by TwelveLabs

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

Submit Feedback