Towards One-to-Many Temporal Grounding

Hugging Face Daily Papers 06/04/26, 12:00 AM Papers

temporal-grounding video-understanding multimodal benchmark chain-of-thought dataset policy-optimization

Summary

This paper introduces One-to-Many Temporal Grounding (OMTG), a new task for localizing multiple disjoint video segments from a single text query, along with a benchmark, evaluation metrics, a 56k-sample dataset, and novel reward functions that achieve state-of-the-art results, outperforming Gemini 2.5 Pro and Seed-1.8.

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

Original Article

View Cached Full Text

Cached at: 06/05/26, 10:07 AM

Paper page - Towards One-to-Many Temporal Grounding

Source: https://huggingface.co/papers/2606.06294 Published on Jun 4

Submitted byhttps://huggingface.co/insomnia7

XuQion Jun 5

Abstract

One-to-Many Temporal Grounding addresses the challenge of localizing multiple disjoint video segments for a single textual query through a comprehensive benchmark, novel reward functions, and improved policy optimization.

Temporal Grounding(TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we termOne-to-Many Temporal Grounding(OMTG). Previous state-of-the-artMLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducingCount Accuracy(C-Acc) andEffective Temporal F1(EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leveragesChain-of-Thought reasoningover dense video captions to explicitly guidepolicy optimizationtoward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2606\.06294

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.06294 in a model README.md to link it from this page.

Datasets citing this paper2

#### insomnia7/omtg56k Viewer• Updatedabout 2 hours ago • 56.2k • 148 #### insomnia7/omtg_bench Viewer• Updatedabout 2 hours ago • 287 • 102

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.06294 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Towards One-to-Many Temporal Grounding

Paper page - Towards One-to-Many Temporal Grounding

Abstract

Models citing this paper0

Datasets citing this paper2

Spaces citing this paper0

Collections including this paper0

Similar Articles

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Submit Feedback

Similar Articles

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions