IR3D-Bench:评估视觉语言模型作为代理逆向渲染的场景理解能力
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
June 29, 2025
作者: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng
cs.AI
摘要
视觉-语言模型(VLMs)在描述性任务上表现出色,但其是否真正理解视觉观察中的场景仍存疑问。我们引入了IR3D-Bench,这一基准测试挑战VLMs通过主动创造而非被动识别来展示理解能力。基于“合成分析”范式,IR3D-Bench要求视觉-语言代理(VLAs)积极运用编程与渲染工具,重构输入图像背后的三维结构,通过工具使用实现代理逆向渲染。这种“通过创造来理解”的方法探究了VLAs利用工具的生成能力,超越了传统场景理解基准所衡量的描述或对话能力。我们提供了一套全面的评估指标,涵盖几何精度、空间关系、外观属性及整体合理性。基于多种前沿VLMs的代理逆向渲染初步实验揭示了当前局限,特别是在视觉精度而非基础工具使用方面。IR3D-Bench,包括数据与评估协议,已公开发布,旨在促进对工具使用型VLAs的系统性研究与发展,通过创造实现真正的场景理解。
English
Vision-language models (VLMs) excel at descriptive tasks, but whether they
truly understand scenes from visual observations remains uncertain. We
introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding
through active creation rather than passive recognition. Grounded in the
analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs)
with actively using programming and rendering tools to recreate the underlying
3D structure of an input image, achieving agentic inverse rendering through
tool use. This "understanding-by-creating" approach probes the tool-using
generative capacity of VLAs, moving beyond the descriptive or conversational
capacity measured by traditional scene understanding benchmarks. We provide a
comprehensive suite of metrics to evaluate geometric accuracy, spatial
relations, appearance attributes, and overall plausibility. Initial experiments
on agentic inverse rendering powered by various state-of-the-art VLMs highlight
current limitations, particularly in visual precision rather than basic tool
usage. IR3D-Bench, including data and evaluation protocols, is released to
facilitate systematic study and development of tool-using VLAs towards genuine
scene understanding by creating.