IR3D-Bench：评估视觉语言模型作为代理逆向渲染的场景理解能力

摘要

视觉-语言模型（VLMs）在描述性任务上表现出色，但其是否真正理解视觉观察中的场景仍存疑问。我们引入了IR3D-Bench，这一基准测试挑战VLMs通过主动创造而非被动识别来展示理解能力。基于“合成分析”范式，IR3D-Bench要求视觉-语言代理（VLAs）积极运用编程与渲染工具，重构输入图像背后的三维结构，通过工具使用实现代理逆向渲染。这种“通过创造来理解”的方法探究了VLAs利用工具的生成能力，超越了传统场景理解基准所衡量的描述或对话能力。我们提供了一套全面的评估指标，涵盖几何精度、空间关系、外观属性及整体合理性。基于多种前沿VLMs的代理逆向渲染初步实验揭示了当前局限，特别是在视觉精度而非基础工具使用方面。IR3D-Bench，包括数据与评估协议，已公开发布，旨在促进对工具使用型VLAs的系统性研究与发展，通过创造实现真正的场景理解。

English

Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.

IR3D-Bench：评估视觉语言模型作为代理逆向渲染的场景理解能力

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

摘要

Support