確信度を考慮したツール連携によるロバストな動画理解

要旨

ビデオ推論言語モデルは、暗黙的にすべての入力フレームが等しく信頼できると仮定する。この仮定は、我々が「盲目的信頼問題」と呼ぶ現象を引き起こす。すなわち、動きぼけ、グレア、遮蔽といった現実的な摂動下において、最先端のビデオ推論モデルは、実世界の身体性ベンチマークで15～30%ポイントの精度低下を被る一方、自身の視覚的証拠が劣化していることに気づかないのである。この課題に取り組むため、我々はRobust-TOを提案する。これは、推論のあらゆる段階にフレーム単位の信頼性を明示的に組み込むエージェント型ビデオ理解フレームワークである。Robust-TOは、異種の視覚認識ツールを統一的エビデンスインターフェースのもとに整理する。各ツールは、元の質問から派生したサブクエリと、信頼性・関連性スコアによって選択された信頼できるフレーム群を受け取る。ツールは、具体的な予測（例：バウンディングボックス、動作軌跡、認識されたテキスト、行動ラベル）、時間的グラウンディング、そして較正された信頼性スコアという共通形式のエビデンスを返す。推論時には、これらの較正されたスコアが、三段階（高/中/低）の統合プロセスにおけるエビデンスの重み付けを導き、正確性、エビデンスの信頼性、効率を同時に最適化する信頼度コストGRPO報酬を定義する。8タスクにわたる2つのビデオ推論ベンチマークにおいて、Robust-TOはクリーンな入力で平均精度56.4%を達成し、最も強力なオープンソースベースラインを10.6ポイント上回り、Gemini-2.5-Pro（46.2%）を凌駕した。また、5種類の現実的な劣化条件下では、Robust-TOは平均精度54.3%を維持し、最も強力なオープンソースベースラインを5.8ポイント上回り、比較手法の中でクリーンから劣化への精度低下が最小であった。

English

Video reasoning language models implicitly assume that every input frame is equally reliable. This leads to what we term the Blind Trust Problem: under realistic perturbations such as motion blur, glare, or occlusion, frontier video reasoning models can suffer 15-30%p accuracy drops on real-world embodied benchmarks, while remaining unaware that their visual evidence has been degraded. To address this challenge, we propose Robust-TO, an agentic video understanding framework that explicitly integrates per-frame trustworthiness into every stage of reasoning. Robust-TO organizes heterogeneous visual perception tools under a unified evidence interface. Each tool receives a sub-query derived from the original question and a set of trustworthy frames selected by the reliability-relevance score. It returns evidence in a shared format: a concrete prediction (e.g., a bounding box, motion trajectory, recognized text, or action label), temporal grounding, and a calibrated reliability score. During reasoning, these calibrated scores guide evidence weighting in a three-tier synthesis process (high/medium/low) and define a confidence-cost GRPO reward that jointly optimizes correctness, evidence reliability, and efficiency. On two video reasoning benchmarks spanning eight tasks, Robust-TO achieves 56.4% average accuracy on clean inputs, surpassing the strongest open-source baseline by 10.6%p and outperforming Gemini-2.5-Pro (46.2%). Under five realistic corruption types, Robust-TO maintains 54.3% average accuracy, 5.8%p above the strongest open-source baseline, while exhibiting the smallest clean-to-corrupted accuracy drop among all compared methods.