IR3D-Bench：評估視覺語言模型作為代理式逆向渲染的場景理解能力

摘要

視覺語言模型（VLMs）在描述性任務上表現卓越，但其是否真正理解視覺觀察中的場景仍不確定。我們引入了IR3D-Bench，這是一個挑戰VLMs通過主動創造而非被動識別來展示理解的基準。基於分析-合成範式，IR3D-Bench要求視覺語言代理（VLAs）積極使用編程和渲染工具來重建輸入圖像的底層3D結構，通過工具使用實現代理逆向渲染。這種“通過創造來理解”的方法探測了VLAs的工具使用生成能力，超越了傳統場景理解基準所測量的描述性或對話能力。我們提供了一套全面的指標來評估幾何精度、空間關係、外觀屬性和整體合理性。基於各種最先進VLMs的代理逆向渲染初步實驗揭示了當前的局限性，特別是在視覺精度而非基本工具使用方面。IR3D-Bench，包括數據和評估協議，已發布以促進系統性研究和開發工具使用的VLAs，通過創造實現真正的場景理解。

English

Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.