ChatPaper.aiChatPaper

IR3D-Bench:评估视觉语言模型作为代理逆向渲染的场景理解能力

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

June 29, 2025
作者: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng
cs.AI

摘要

视觉-语言模型(VLMs)在描述性任务上表现出色,但其是否真正理解视觉观察中的场景仍存疑问。我们引入了IR3D-Bench,这一基准测试挑战VLMs通过主动创造而非被动识别来展示理解能力。基于“合成分析”范式,IR3D-Bench要求视觉-语言代理(VLAs)积极运用编程与渲染工具,重构输入图像背后的三维结构,通过工具使用实现代理逆向渲染。这种“通过创造来理解”的方法探究了VLAs利用工具的生成能力,超越了传统场景理解基准所衡量的描述或对话能力。我们提供了一套全面的评估指标,涵盖几何精度、空间关系、外观属性及整体合理性。基于多种前沿VLMs的代理逆向渲染初步实验揭示了当前局限,特别是在视觉精度而非基础工具使用方面。IR3D-Bench,包括数据与评估协议,已公开发布,旨在促进对工具使用型VLAs的系统性研究与发展,通过创造实现真正的场景理解。
English
Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.
PDF51July 2, 2025