ChatPaper.aiChatPaper

视频推理模型是否已具备实际应用能力?

Are Video Reasoning Models Ready to Go Outside?

March 11, 2026
作者: Yangfan He, Changgyu Boo, Jaehong Yoon
cs.AI

摘要

在实际应用中,视觉语言模型常面临天气变化、遮挡和相机运动等干扰。此类条件下,模型的理解与推理能力会显著下降,暴露出受控(即无干扰)评估环境与真实世界鲁棒性之间的差距。为突破这一局限,我们提出ROVA训练框架,通过构建时空干扰下的鲁棒感知一致性奖励机制来提升模型稳健性。ROVA采用难度感知的在线训练策略,根据模型动态能力优先选择信息量丰富的样本。具体而言,框架通过自反思评估持续更新样本难度估计,实现基于鲁棒感知一致性奖励的自适应训练。我们还推出PVRBench新基准,通过向具身视频数据集注入真实世界扰动,评估模型在现实干扰下的准确性与推理质量。在PVRBench、UrbanVideo和VisBench上的实验表明,开源与商用模型在真实扰动下准确率与推理能力最大降幅分别达35%和28%。相较基线模型(QWen2.5/3-VL、InternVL2.5、Embodied-R),ROVA有效缓解性能衰退,相对准确率提升至少24%,推理能力提升超9%。这些增益可迁移至洁净标准基准,带来持续改进效果。
English
In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
PDF62March 15, 2026