ビデオ推論モデルは外部環境への適用準備が整っているか？

要旨

実環境における視覚言語モデルの展開では、天候、遮蔽、カメラ動作などの外乱に頻繁に直面する。こうした条件下では、モデルの理解と推論能力は大幅に低下し、清浄で制御された（すなわち摂動のない）評価環境と実世界のロバスト性の間に隔たりが生じることが明らかになっている。この課題を解決するため、我々はROVAを提案する。これは時空間的摂動下でのロバスト性を考慮した一貫性報酬をモデル化することで、堅牢性を向上させる新しいトレーニングフレームワークである。ROVAは、モデルの発展する能力に基づいて情報量の多いサンプルを優先する、難易度を考慮したオンライン学習戦略を導入する。具体的には、自己反省的評価を通じてサンプルの難易度を継続的に再推定し、ロバスト性を考慮した一貫性報誉を用いた適応的トレーニングを可能にする。さらに、実世界の摂動を具象化ビデオデータセットに注入し、現実的な外乱下での精度と推論品質の両方を評価する新しいベンチマークPVRBenchを提案する。我々はROVAとベースラインモデルをPVRBench、UrbanVideo、VisBenchで評価した。その結果、オープンソースモデルおよびプロプライエタリモデルは、現実的な摂動下で精度が最大35%、推論能力が最大28%低下することが明らかになった。ROVAは性能低下を効果的に抑制し、ベースラインモデル（QWen2.5/3-VL、InternVL2.5、Embodied-R）と比較して、相対精度を少なくとも24%以上、推論能力を9%以上向上させた。これらの改善効果は、清浄な標準ベンチマークにも転移し、一貫した性能向上をもたらした。

English

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

ビデオ推論モデルは外部環境への適用準備が整っているか？

Are Video Reasoning Models Ready to Go Outside?

要旨

Support