Tag
This paper introduces MetaphorVU-Bench, the first systematic benchmark for metaphorical video understanding, and proposes MetaphorBoost, an inference-time enhancement framework that improves cross-domain mapping in multimodal large language models.
ByteDance releases Lance, a 3B parameter unified multimodal model supporting image and video generation, understanding, and editing, trained from scratch with a multi-task recipe.
Introduces Flat-Pack Bench, a benchmark for evaluating fine-grained spatio-temporal reasoning in large vision-language models using furniture assembly tasks. Experiments show current LVLMs struggle with tracking and spatial interactions.
Open-sourcing Marlin-2B, a tiny VLM for extracting structured information from videos, fine-tuned to answer 'what is happening and when'. Best open model in its weight class, competitive with Gemini-2.5-flash.
ParaVT introduces the first multi-agent end-to-end RL framework for parallel video tool calling, addressing the Tool Prior Paradox with PARA-GRPO, and fully open-sources the paper, code, weights, and data.
Grok now supports full video analysis, including summarization, translation, scene explanation, and context extraction, becoming natively multimodal with strong vision capabilities.
SWIM is a novel training strategy that aligns vision and language representations for fine-grained object understanding using only textual prompts, leveraging mask supervision during training to improve cross-modal attention. It introduces the NL-Refer dataset and achieves superior performance over visual-prompt-based methods.
OmniPro is the first benchmark for evaluating proactive streaming video understanding in omni-modal large language models, featuring 2,700 samples covering diverse tasks and dual-mode evaluation protocols.
LiteFrame proposes a lightweight video encoder with Compressed Token Distillation training that reduces latency and enables processing 8x more frames for long-form video understanding in Video LLMs, improving accuracy while reducing compute.
ByteDance Research introduces Lance, a 3B-parameter unified multimodal model trained from scratch on 128 A100 GPUs, capable of image and video understanding, generation, and editing within a single framework.
VideoSeeker introduces a paradigm for instance-level video understanding that integrates agentic reasoning with visual prompts, achieving superior performance through automated data synthesis and reinforcement learning, outperforming GPT-4o and Gemini-2.5-Pro.
NVIDIA has open-sourced the video understanding model Nemotron 3 Nano Omni, which uses 3D convolution technology and processes video 10 times faster than playback speed. It excels at audio-video analysis, surveillance retrieval, and asset tagging, but is not suitable for code or text inference tasks.
ViMU is the first benchmark designed to evaluate video understanding models' ability to interpret metaphorical, ironic, and social meanings beyond literal visual comprehension, using hint-free open-ended and multiple-choice questions.
This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.
A GitHub repository providing minimal, standalone PyTorch reimplementations of JEPA family models (I-JEPA, V-JEPA, V-JEPA 2, C-JEPA) for educational purposes, including tutorials and visualization tools.
GridProbe is a training-free inference paradigm for Long-Video VLMs that adaptively selects relevant frames using posterior probing, achieving sub-quadratic attention costs with minimal accuracy loss.
NomadicAI is building an agentic computer-vision product to fix VLMs' weak grounding in actual video content.
SignX proposes a novel framework for continuous sign language recognition that unifies heterogeneous pose formats into a compact latent space and achieves state-of-the-art accuracy with 50× computational acceleration over pixel-space baselines.
EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.
Pegasus 1.5 is an AI model by TwelveLabs designed to transform video content into time-based metadata, enabling automated video understanding and indexing.