RoboStressBench：在具身場景中評估視覺語言模型對物理視覺壓力之穩健性

摘要

視覺語言模型（Vision-Language Models, VLMs）展現出強大的視覺理解能力，並逐漸被部署於具身人工智慧系統中——在真實條件下具備可靠的感知能力至關重要。然而，現有基準測試多以乾淨圖像或孤立擾動評估VLM，而非由物理場景形成所導致的壓力。此設計存在兩項限制：僅涵蓋日常視覺壓力中的狹窄子集，且部分擾動在真實具身場景中鮮少出現。此缺口引出一項根本問題：我們該如何以原理性方式定義視覺壓力，使其能涵蓋物理環境中遭遇的多樣因素？為解答此問題，我們從逆圖形學視角建模視覺感知，並提出RoboStressBench——一套評估VLM在具身場景中對物理視覺壓力魯棒性的基準。受物理渲染方程啟發，RoboStressBench將視覺壓力分解為四個物理基礎維度：材質（M）、視角（V）、光照（L）與幾何（G）。此設計使RoboStressBench能涵蓋真實世界中廣泛的視覺壓力，同時允許對其影響VLM能力（如視覺辨識、推理與規劃）進行受控分析。透過對當前先進VLM的全面評估，我們識別出壓力特定失效模式，並揭示不同物理因素會降低不同的具身能力——這些能力常被總體準確率所掩蓋。我們進一步提出壓力感知型智能求解器，該求解器在推理前偵測視覺壓力源並調用視覺編輯技能，從而在高壓力場景中提升魯棒性。總體而言，RoboStressBench提供一套原理性評估框架，用以診斷與改進VLM在真實物理壓力下的感知能力，進而支持更可靠的具身人工智慧系統開發。

English

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.