video-understanding

#video-understanding

MetaphorVU: Towards Metaphorical Video Understanding

Hugging Face Daily Papers ↗ · 2026-05-25 Cached

This paper introduces MetaphorVU-Bench, the first systematic benchmark for metaphorical video understanding, and proposes MetaphorBoost, an inference-time enhancement framework that improves cross-domain mapping in multimodal large language models.

0 favorites 0 likes

#video-understanding

Show HN: Lance – image/video generation and understanding in one model

Hacker News Top ↗ · 2026-05-20 Cached

ByteDance releases Lance, a 3B parameter unified multimodal model supporting image and video generation, understanding, and editing, trained from scratch with a multi-task recipe.

0 favorites 0 likes

#video-understanding

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Hugging Face Daily Papers ↗ · 2026-05-20 Cached

Introduces Flat-Pack Bench, a benchmark for evaluating fine-grained spatio-temporal reasoning in large vision-language models using furniture assembly tasks. Experiments show current LVLMs struggle with tracking and spatial interactions.

0 favorites 0 likes

#video-understanding

@HappyyPablo: open sourcing Marlin-2B a tiny VLM to extract structured information from videos Marlin is finetuned for two questions …

X AI KOLs Timeline ↗ · 2026-05-19 Cached

Open-sourcing Marlin-2B, a tiny VLM for extracting structured information from videos, fine-tuned to answer 'what is happening and when'. Best open model in its weight class, competitive with Gemini-2.5-flash.

1 favorites 1 likes

#video-understanding

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-05-19 Cached

ParaVT introduces the first multi-agent end-to-end RL framework for parallel video tool calling, addressing the Tool Prior Paradox with PARA-GRPO, and fully open-sources the paper, code, weights, and data.

0 favorites 0 likes

#video-understanding

@elonmusk: Grok groks videos

X AI KOLs Following ↗ · 2026-05-18 Cached

Grok now supports full video analysis, including summarization, translation, scene explanation, and context extraction, becoming natively multimodal with strong vision capabilities.

0 favorites 0 likes

#video-understanding

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

SWIM is a novel training strategy that aligns vision and language representations for fine-grained object understanding using only textual prompts, leveraging mask supervision during training to improve cross-modal attention. It introduces the NL-Refer dataset and achieves superior performance over visual-prompt-based methods.

0 favorites 0 likes

#video-understanding

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

OmniPro is the first benchmark for evaluating proactive streaming video understanding in omni-modal large language models, featuring 2,700 samples covering diverse tasks and dual-mode evaluation protocols.

0 favorites 0 likes

#video-understanding

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

Hugging Face Daily Papers ↗ · 2026-05-17 Cached

LiteFrame proposes a lightweight video encoder with Compressed Token Distillation training that reduces latency and enables processing 8x more frames for long-form video understanding in Video LLMs, improving accuracy while reducing compute.

0 favorites 0 likes

#video-understanding

bytedance-research/Lance

Hugging Face Models Trending ↗ · 2026-05-15 Cached

ByteDance Research introduces Lance, a 3B-parameter unified multimodal model trained from scratch on 128 A100 GPUs, capable of image and video understanding, generation, and editing within a single framework.

0 favorites 0 likes

#video-understanding

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

Hugging Face Daily Papers ↗ · 2026-05-15 Cached

VideoSeeker introduces a paradigm for instance-level video understanding that integrates agentic reasoning with visual prompts, achieving superior performance through automated data synthesis and reinforcement learning, outperforming GPT-4o and Gemini-2.5-Pro.

0 favorites 0 likes

#video-understanding

@VincentLogic: NVIDIA really went all out this time, directly releasing an open-source video understanding monster Nemotron 3 Nano Omni that processes video at an insane speed: 1 hour to handle 10 hours of video content, 10 times faster than playback speed. The core relies on 3D convolution technology, no longer scanning frame by frame, but instead…

X AI KOLs Timeline ↗ · 2026-05-14

NVIDIA has open-sourced the video understanding model Nemotron 3 Nano Omni, which uses 3D convolution technology and processes video 10 times faster than playback speed. It excels at audio-video analysis, surveillance retrieval, and asset tagging, but is not suitable for code or text inference tasks.

0 favorites 0 likes

#video-understanding

ViMU: Benchmarking Video Metaphorical Understanding

Hugging Face Daily Papers ↗ · 2026-05-14 Cached

ViMU is the first benchmark designed to evaluate video understanding models' ability to interpret metaphorical, ironic, and social meanings beyond literal visual comprehension, using hint-free open-ended and multiple-choice questions.

0 favorites 0 likes

#video-understanding

When Vision Speaks for Sound

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.

0 favorites 0 likes

#video-understanding

GitHub - keon/jepa: implementing minimal versions of joint-embedding predictive architecture (JEPA)

Reddit r/ArtificialInteligence ↗ · 2026-05-12 Cached

A GitHub repository providing minimal, standalone PyTorch reimplementations of JEPA family models (I-JEPA, V-JEPA, V-JEPA 2, C-JEPA) for educational purposes, including tutorials and visualization tools.

0 favorites 0 likes

#video-understanding

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

GridProbe is a training-free inference paradigm for Long-Video VLMs that adaptively selects relevant frames using posterior probing, achieving sub-quadratic attention costs with minimal accuracy loss.

0 favorites 0 likes

#video-understanding

@nomadicai: The future of computer vision is agentic. 1/ We built Nomadic around a gap we kept seeing in video understanding: VLMs …

X AI KOLs Following ↗ · 2026-04-21 Cached

NomadicAI is building an agentic computer-vision product to fix VLMs' weak grounding in actual video content.

0 favorites 0 likes

#video-understanding

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

arXiv cs.CL ↗ · 2026-04-20 Cached

SignX proposes a novel framework for continuous sign language recognition that unifies heterogeneous pose formats into a compact latent space and achieves state-of-the-art accuracy with 50× computational acceleration over pixel-space baselines.

0 favorites 0 likes

#video-understanding

EasyVideoR1: Easier RL for Video Understanding

Hugging Face Daily Papers ↗ · 2026-04-18 Cached

EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.

0 favorites 0 likes

#video-understanding

Pegasus 1.5 by TwelveLabs

Product Hunt ↗ · 2026-04-14

Pegasus 1.5 is an AI model by TwelveLabs designed to transform video content into time-based metadata, enabling automated video understanding and indexing.

0 favorites 0 likes

video-understanding

Submit Feedback