Confidence-Aware Tool Orchestration for Robust Video Understanding

Hugging Face Daily Papers 06/25/26, 12:00 AM Papers

video-understanding tool-orchestration confidence-aware robust agentic-framework reasoning trustworthiness

Summary

Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework, improving accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.

Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.

Original Article

View Cached Full Text

Cached at: 06/26/26, 06:05 AM

Paper page - Confidence-Aware Tool Orchestration for Robust Video Understanding

Source: https://huggingface.co/papers/2606.26904

Abstract

Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.

Video reasoninglanguage models implicitly assume that every input frame is equally reliable. This leads to what we term theBlind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontiervideo reasoningmodels can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, anagentic video understandingframework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unifiedevidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by thereliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and acalibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in athree-tier synthesis process(high/medium/low) and define aconfidence-cost GRPO rewardthat jointly optimizes correctness, evidence reliability, and efficiency. On twovideo reasoning benchmarksspanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.

View arXiv page View PDF Project page GitHub1 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.26904 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.26904 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.26904 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Confidence-Aware Tool Orchestration for Robust Video Understanding

Paper page - Confidence-Aware Tool Orchestration for Robust Video Understanding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@_akhaliq: paper:

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

Submit Feedback

Similar Articles

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning