IR3D-Bench: 에이전트 역렌더링으로서의 비전-언어 모델 장면 이해 평가

초록

비전-언어 모델(VLMs)은 기술적 작업에서 뛰어난 성능을 보이지만, 시각적 관찰을 통해 장면을 진정으로 이해하는지 여부는 여전히 불확실합니다. 본 연구에서는 VLMs이 수동적 인식이 아닌 능동적 창작을 통해 이해를 입증하도록 도전하는 벤치마크인 IR3D-Bench를 소개합니다. 분석-합성 패러다임에 기반을 둔 IR3D-Bench는 비전-언어 에이전트(VLAs)가 프로그래밍 및 렌더링 도구를 능동적으로 사용하여 입력 이미지의 기본 3D 구조를 재구성하도록 요구함으로써, 도구 사용을 통한 에이전트적 역렌더링을 달성합니다. 이 "창작을 통한 이해" 접근법은 전통적인 장면 이해 벤치마크에서 측정되는 기술적 또는 대화적 능력을 넘어, VLAs의 도구 사용 생성 능력을 탐구합니다. 우리는 기하학적 정확도, 공간 관계, 외관 속성 및 전반적인 타당성을 평가하기 위한 포괄적인 메트릭 세트를 제공합니다. 다양한 최첨단 VLMs을 기반으로 한 에이전트적 역렌더링에 대한 초기 실험은 기본 도구 사용보다는 시각적 정밀도에서 현재의 한계를 강조합니다. IR3D-Bench는 데이터 및 평가 프로토콜을 포함하여, 창작을 통한 진정한 장면 이해를 향한 도구 사용 VLAs의 체계적인 연구 및 개발을 촉진하기 위해 공개되었습니다.

English

Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.

IR3D-Bench: 에이전트 역렌더링으로서의 비전-언어 모델 장면 이해 평가

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

초록

Support