Towards One-to-Many Temporal Grounding
Summary
This paper introduces One-to-Many Temporal Grounding (OMTG), a new task for localizing multiple disjoint video segments from a single text query, along with a benchmark, evaluation metrics, a 56k-sample dataset, and novel reward functions that achieve state-of-the-art results, outperforming Gemini 2.5 Pro and Seed-1.8.
View Cached Full Text
Cached at: 06/05/26, 10:07 AM
Paper page - Towards One-to-Many Temporal Grounding
Source: https://huggingface.co/papers/2606.06294 Published on Jun 4
·
Submitted byhttps://huggingface.co/insomnia7
XuQion Jun 5
Abstract
One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.
Temporal Grounding(TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we termOne-to-Many Temporal Grounding(OMTG). Previous state-of-the-artMLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducingCount Accuracy(C-Acc) andEffective Temporal F1(EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leveragesChain-of-Thought reasoningover dense video captions to explicitly guidepolicy optimizationtoward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.06294
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.06294 in a model README.md to link it from this page.
Datasets citing this paper2
#### insomnia7/omtg56k Viewer• Updatedabout 2 hours ago • 56.2k • 148 #### insomnia7/omtg_bench Viewer• Updatedabout 2 hours ago • 287 • 102
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.06294 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
OVO-S-Bench introduces a comprehensive human-annotated benchmark of 1,680 questions across 348 videos to evaluate streaming spatial intelligence in multimodal LLMs, revealing that even the best model (Gemini-3.1-Pro) trails human experts by 27 points. The benchmark exposes key limitations including allocentric mapping as a major bottleneck and chain-of-thought reasoning amplifying spatial errors.
SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]
SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.
MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs
MusTBench is a benchmark for evaluating temporal grounding in Large Audio-Language Models (LALMs) for music understanding. The authors propose MusT, a four-stage training recipe that significantly improves temporal grounding performance over existing models.
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
LocateAnything proposes Parallel Box Decoding for unified visual grounding and object detection, decoding geometric elements as atomic units to improve throughput and localization accuracy, supported by a large-scale dataset of 138M samples.
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
GRASP is a large-scale dataset for social reasoning in multi-person videos, connecting high-level social questions with fine-grained gaze and gesture events, and introduces Social Grounding Reward to improve multimodal model understanding.