@_akhaliq: paper:
Summary
This paper proposes Robust-TO, an agentic video understanding framework that integrates per-frame trustworthiness to address the Blind Trust Problem, achieving significant accuracy gains under realistic perturbations.
View Cached Full Text
Cached at: 06/26/26, 12:10 PM
paper: https://t.co/eUoddauH25
Paper page - Confidence-Aware Tool Orchestration for Robust Video Understanding
Source: https://huggingface.co/papers/2606.26904
Abstract
Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework that improves accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.
Video reasoninglanguage models implicitly assume that every input frame is equally reliable. This leads to what we term theBlind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontiervideo reasoningmodels can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, anagentic video understandingframework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unifiedevidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by thereliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and acalibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in athree-tier synthesis process(high/medium/low) and define aconfidence-cost GRPO rewardthat jointly optimizes correctness, evidence reliability, and efficiency. On twovideo reasoning benchmarksspanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.
View arXiv pageView PDFProject pageGitHub1Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.26904 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.26904 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.26904 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Confidence-Aware Tool Orchestration for Robust Video Understanding
Robust-TO addresses the Blind Trust Problem in video reasoning by integrating per-frame trustworthiness into an agentic framework, improving accuracy under realistic perturbations through calibrated evidence weighting and reliability-aware reasoning.
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?
Robust-U1 is a framework that enables multimodal large language models (MLLMs) to self-recover corrupted visual content using supervised fine-tuning, reinforcement learning with dual rewards, and joint multimodal reasoning, achieving state-of-the-art robustness on corruption benchmarks.
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security
This survey provides a comprehensive examination of trustworthy agentic AI, focusing on safety, robustness, privacy, and system security. It clarifies key concepts, identifies risks along the agent workflow, summarizes mitigation strategies, and consolidates evaluation metrics and benchmarks, aiming to serve as a practical reference for deploying agentic AI in high-stakes environments.
Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On
This vision paper argues that trust in Agent-to-Agent (A2A) networks must be integrated from the ground up, as existing agent alignment techniques are insufficient to address systemic vulnerabilities like adversarial composition and semantic misalignment.
The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
This paper challenges the assumption that current Vision-Language Models faithfully synthesize multimodal data, proposing an information-theoretic Modality Translation Protocol with new metrics (Toll, Curse, Fallacy of Seeing) to evaluate trustworthiness over traditional multimodal gain.