ChatPaper.aiChatPaper

IR3D-Bench:評估視覺語言模型作為代理式逆向渲染的場景理解能力

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

June 29, 2025
作者: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng
cs.AI

摘要

視覺語言模型(VLMs)在描述性任務上表現卓越,但其是否真正理解視覺觀察中的場景仍不確定。我們引入了IR3D-Bench,這是一個挑戰VLMs通過主動創造而非被動識別來展示理解的基準。基於分析-合成範式,IR3D-Bench要求視覺語言代理(VLAs)積極使用編程和渲染工具來重建輸入圖像的底層3D結構,通過工具使用實現代理逆向渲染。這種“通過創造來理解”的方法探測了VLAs的工具使用生成能力,超越了傳統場景理解基準所測量的描述性或對話能力。我們提供了一套全面的指標來評估幾何精度、空間關係、外觀屬性和整體合理性。基於各種最先進VLMs的代理逆向渲染初步實驗揭示了當前的局限性,特別是在視覺精度而非基本工具使用方面。IR3D-Bench,包括數據和評估協議,已發布以促進系統性研究和開發工具使用的VLAs,通過創造實現真正的場景理解。
English
Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.
PDF51July 2, 2025