RoboStressBench: 身体化されたシーンにおける物理的視覚ストレスに対するVLMのロバスト性のベンチマーキング

要旨

視覚言語モデル（VLM）は強力な視覚理解能力を示しており、現実条件下での信頼性の高い認識が不可欠な具現化AIシステムへの応用が進んでいる。しかし、既存のベンチマークでは、クリーンな画像や孤立した摂動を用いてVLMを評価するのみであり、物理的なシーン形成に起因するストレスを考慮していない。この設計には二つの限界がある。すなわち、日常的な視覚ストレスの狭い部分集合しか対象とせず、また一部の摂動は現実的な具現化シーンではほとんど出現しない。この乖離は、物理環境で遭遇する多様な要因を捉える原理的な視覚ストレスの定義方法という根本的な問いを提起する。この問いに答えるため、我々は逆グラフィックスの観点から視覚知覚を定式化し、具現化シーンにおける物理的視覚ストレスに対するVLMの頑健性を評価するベンチマークRoboStressBenchを導入する。物理的レンダリング方程式に着想を得たRoboStressBenchは、視覚ストレスを素材（M）、視点（V）、照明（L）、幾何形状（G）の四つの物理的に基づく次元に分解する。この設計により、RoboStressBenchは実環境における広範な視覚ストレスをカバーすると同時に、視覚認識、推論、計画といったVLMの能力に対するそれらの影響を制御可能な形で分析できる。最新のVLMに対する包括的評価を通じて、ストレス固有の故障モードを特定し、異なる物理的要因が異なる具現化能力を劣化させること、そしてこれらの影響が総合精度によってしばしば隠蔽されることを明らかにする。さらに、ストレスを認識するエージェント型解法を導入し、推論前に視覚ストレッサーを検出して視覚編集スキルを呼び出すことで、高ストレスシナリオにおける頑健性を向上させる。総じて、RoboStressBenchは現実世界の物理的ストレス下におけるVLMの知覚を診断・改善するための原理的な評価枠組みを提供し、より信頼性の高い具現化AIシステムの開発を支援する。

English

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.