RoboStressBench: 체화된 장면에서 물리적 시각 스트레스에 대한 VLM 강건성 벤치마킹

초록

시각-언어 모델(VLM)은 강력한 시각적 이해 능력을 보여주며, 현실 조건에서 신뢰할 수 있는 지각이 필수적인 구현형 AI 시스템에 점점 더 많이 배치되고 있다. 그러나 기존 벤치마크는 깨끗한 이미지나 고립된 교란(perturbations)을 사용하여 VLM을 평가할 뿐, 물리적 장면 형성으로 인한 스트레스는 평가하지 않는다. 이러한 설계는 두 가지 한계를 가진다: 일상적인 시각적 스트레스의 극히 일부만을 다루며, 일부 교란은 현실적인 구현형 장면에서는 거의 나타나지 않는다. 이러한 격차는 근본적인 질문을 제기한다: 물리적 환경에서 마주치는 다양한 요인들을 포착하는 원칙적인 방식으로 시각적 스트레스를 어떻게 정의할 수 있는가? 이 질문에 답하기 위해, 우리는 역그래픽스(inverse graphics) 관점에서 시각 지각을 정식화하고, 구현형 장면에서의 물리적 시각적 스트레스에 대한 VLM의 견고성을 평가하기 위한 벤치마크인 RoboStressBench를 소개한다. 물리적 렌더링 방정식(physical rendering equation)에서 영감을 받아, RoboStressBench는 시각적 스트레스를 네 가지 물리적으로 기반한 차원으로 분해한다: 재질(M), 시점(V), 조명(L), 기하학(G). 이 설계는 RoboStressBench가 실제 환경에서 다양한 시각적 스트레스를 포괄하면서도, 시각적 인식, 추론 및 계획과 같은 VLM 능력에 미치는 영향을 통제된 방식으로 분석할 수 있게 한다. 최신 VLM에 대한 포괄적인 평가를 통해, 우리는 스트레스 특이적 실패 모드를 식별하고, 서로 다른 물리적 요인이 서로 다른 구현 능력을 저하시키며, 이는 종합 정확도에서는 종종 가려진다는 것을 밝힌다. 또한, 우리는 스트레스 인식 에이전트 솔버(stress-aware agentic solver)를 도입하여, 추론 전에 시각적 스트레스 요인을 탐지하고 시각 편집 기술을 호출함으로써 고스트레스 시나리오에서의 견고성을 향상시킨다. 전반적으로, RoboStressBench는 실제 물리적 스트레스 하에서 VLM 지각을 진단하고 개선하기 위한 원칙적인 평가 프레임워크를 제공하며, 보다 신뢰할 수 있는 구현형 AI 시스템 개발을 지원한다.

English

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.