面向鲁棒视频理解的置信度感知工具编排

摘要

视频推理语言模型隐含地假设每一帧输入都具有相同的可靠性。这导致了我们称之为“盲目信任问题”的现象：在运动模糊、眩光或遮挡等现实扰动下，前沿视频推理模型在真实世界具身基准测试中准确率可能下降15-30个百分点，且模型本身对视觉证据已退化的事实毫无察觉。为解决这一挑战，我们提出Robust-TO（鲁棒性工具组织框架），这是一种智能体化的视频理解框架，将逐帧可信度显式融入推理的每个阶段。Robust-TO将异构的视觉感知工具组织在统一证据接口下。每个工具接收从原始问题中派生出的子查询，以及一组由“可靠性-相关性”评分筛选出的可信帧。工具返回格式统一的证据：包含具体预测结果（如边界框、运动轨迹、识别文本或动作标签）、时间定位信息以及经校准的可信度评分。在推理过程中，这些经校准的评分会引导一个三级综合处理过程（高/中/低）中的证据加权，并定义一种置信-成本GRPO奖励函数，该函数联合优化正确性、证据可靠性和效率。在涵盖八个任务的两种视频推理基准测试上，Robust-TO在干净输入上达到56.4%的平均准确率，超过最强的开源基线模型10.6个百分点，并优于Gemini-2.5-Pro（46.2%）。在五种现实扰动的条件下，Robust-TO保持54.3%的平均准确率，比最强开源基线高5.8个百分点，同时在所有对比方法中展现出最小的从干净输入到受损输入的准确率下降幅度。

English

Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.