비디오 추론 모델은 실외 환경에 적용할 준비가 되었는가?

초록

실제 환경에서 비전-언어 모델은 날씨, 폐색, 카메라 모션과 같은 다양한 방해 요인에 직면합니다. 이러한 조건에서 모델의 이해와 추론 능력은 현저히 저하되며, 이는 깨끗하고 통제된(즉, 방해가 없는) 평가 환경과 실제 강건성 간의 격차를 드러냅니다. 이러한 한계를 해결하기 위해 본 논문은 시공간적 손상 하에서 강건성 인지 일관성 보상을 모델링하여 강건성을 향상시키는 새로운 훈련 프레임워크인 ROVA를 제안합니다. ROVA는 모델의 진화하는 능력에 기반하여 정보성이 높은 샘플을 우선적으로 학습하는 난이도 인지 온라인 훈련 전략을 도입합니다. 구체적으로, 자기 반성적 평가를 통해 샘플 난이도를 지속적으로 재추정하여 강건성 인지 일관성 보상을 통한 적응형 훈련을 가능하게 합니다. 또한 실제적 방해 하에서 정확도와 추론 품질을 평가하기 위해 구현 비디오 데이터셋에 실제 세계의 방해를 주입하는 새로운 벤치마크인 PVRBench을 소개합니다. ROVA와 기준 모델을 PVRBench, UrbanVideo 및 VisBench에서 평가한 결과, 실제적 방해 하에서 오픈소스와 상용 모델의 정확도와 추론 점수가 각각 최대 35%, 28% 하락하는 것으로 나타났습니다. ROVA는 이러한 성능 저하를 효과적으로 완화하여 기준 모델(QWen2.5/3-VL, InternVL2.5, Embodied-R) 대비 상대 정확도를 최소 24% 이상, 추론 점수를 9% 이상 향상시켰습니다. 이러한 향상은 깨끗한 표준 벤치마크로도 전이되어 일관된 성능 개선을 보여주었습니다.

English

In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

비디오 추론 모델은 실외 환경에 적용할 준비가 되었는가?

Are Video Reasoning Models Ready to Go Outside?

초록

Support