Tag
OVO-S-Bench introduces a comprehensive human-annotated benchmark of 1,680 questions across 348 videos to evaluate streaming spatial intelligence in multimodal LLMs, revealing that even the best model (Gemini-3.1-Pro) trails human experts by 27 points. The benchmark exposes key limitations including allocentric mapping as a major bottleneck and chain-of-thought reasoning amplifying spatial errors.
Introduces TaskMem, a reinforcement-learning-based framework for dynamic memorization in multimodal agents, achieving accuracy improvements of 6.3%, 7.0%, and 5.3% on streaming video benchmarks.
This paper introduces AdaState, a method that replaces the static first-frame anchor in autoregressive video diffusion models with an adaptive state that evolves with the generated content, enabling richer motion and natural scene progression in streaming video generation.
OmniPro is the first benchmark for evaluating proactive streaming video understanding in omni-modal large language models, featuring 2,700 samples covering diverse tasks and dual-mode evaluation protocols.
Stream-R1 introduces a reliability-perplexity aware reward distillation framework for streaming video generation that adaptively weights supervision to improve visual and motion quality without additional computational overhead.