Confidence-Aware Tool Orchestration for Robust Video Understanding
Summary
Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework, improving accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.
View Cached Full Text
Cached at: 06/26/26, 06:05 AM
Paper page - Confidence-Aware Tool Orchestration for Robust Video Understanding
Source: https://huggingface.co/papers/2606.26904
Abstract
Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.
Video reasoninglanguage models implicitly assume that every input frame is equally reliable. This leads to what we term theBlind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontiervideo reasoningmodels can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, anagentic video understandingframework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unifiedevidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by thereliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and acalibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in athree-tier synthesis process(high/medium/low) and define aconfidence-cost GRPO rewardthat jointly optimizes correctness, evidence reliability, and efficiency. On twovideo reasoning benchmarksspanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.
View arXiv pageView PDFProject pageGitHub1Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.26904 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.26904 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.26904 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@_akhaliq: paper:
This paper proposes Robust-TO, an agentic video understanding framework that integrates per-frame trustworthiness to address the Blind Trust Problem, achieving significant accuracy gains under realistic perturbations.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.
Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning
Proposes Agent-ToM, a learning-to-monitor framework using Theory-of-Mind reasoning to detect covert malicious behavior in autonomous LLM agents by inferring beliefs and intents, outperforming baseline monitors.
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?
Robust-U1 is a framework that enables multimodal large language models (MLLMs) to self-recover corrupted visual content using supervised fine-tuning, reinforcement learning with dual rewards, and joint multimodal reasoning, achieving state-of-the-art robustness on corruption benchmarks.
CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning
This paper introduces CoRA, a GRPO-based reinforcement learning framework that aligns LLM confidence with generated rationales to improve the reliability of chain-of-thought reasoning, achieving up to 26.51% reduction in misalignment error across multiple benchmarks.