RoboStressBench：具身场景中VLM对物理视觉压力的鲁棒性基准测试

摘要

视觉语言模型（VLM）已展现出强大的视觉理解能力，并被越来越多地部署在具身AI系统中。在这些系统中，真实条件下可靠的感知能力至关重要。然而，现有基准测试使用干净图像或孤立扰动来评估VLM，而非由物理场景形成过程产生的应力。这种设计存在两个局限：仅覆盖日常视觉应力中的一小部分子集，且部分扰动在真实具身场景中鲜有出现。这一差距引出一个根本性问题：我们如何以原则性的方式定义视觉应力，以捕捉物理环境中遇到的各种因素？针对此问题，我们从逆图形视角构建视觉感知框架，并引入RoboStressBench——一个用于评估VLM在具身场景中应对物理视觉应力鲁棒性的基准。受物理渲染方程启发，RoboStressBench将视觉应力分解为四个物理可解释的维度：材质（M）、视角（V）、光照（L）和几何（G）。这一设计使RoboStressBench能够覆盖真实世界中广泛的视觉应力类型，同时允许对其影响VLM能力（如视觉识别、推理和规划）进行受控分析。通过对当前最先进VLM的全面评估，我们识别出特定应力下的失败模式，并揭示不同物理因素会削弱不同具身能力——这些差异往往被聚合准确率所掩盖。我们进一步引入一种应力感知型智能求解器，它能在推理前检测视觉应力源并调用视觉编辑技能，从而提升高压场景下的鲁棒性。总体而言，RoboStressBench提供了一个原则性的评估框架，用于诊断和改进VLM在真实物理应力下的感知能力，支持开发更可靠的具身AI系统。

English

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.