BlenderGym:圖形編輯基礎模型系統的基準測試平台
BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing
April 2, 2025
作者: Yunqi Gu, Ian Huang, Jihyeon Je, Guandao Yang, Leonidas Guibas
cs.AI
摘要
3D圖形編輯在電影製作和遊戲設計等應用中至關重要,但這仍然是一個耗時且需要高度專業領域知識的過程。自動化這一過程具有挑戰性,因為圖形編輯需要執行多種任務,每種任務都需要不同的技能組合。最近,視覺語言模型(VLMs)已成為自動化編輯過程的強大框架,但其開發和評估因缺乏一個需要人類水平感知並呈現真實世界編輯複雜性的全面基準而受到限制。在本研究中,我們提出了BlenderGym,這是首個用於3D圖形編輯的全面VLM系統基準。BlenderGym通過基於代碼的3D重建任務來評估VLM系統。我們評估了閉源和開源的VLM系統,並觀察到即使是最先進的VLM系統在對人類Blender用戶相對容易的任務上也表現不佳。借助BlenderGym,我們研究了推理擴展技術如何影響VLM在圖形編輯任務上的表現。值得注意的是,我們的研究結果表明,用於指導生成擴展的驗證器本身可以通過推理擴展來改進,這補充了最近關於LLM生成在編碼和數學任務中推理擴展的見解。我們進一步表明,推理計算並非均勻有效,可以通過在生成和驗證之間策略性地分配來優化。
English
3D graphics editing is crucial in applications like movie production and game
design, yet it remains a time-consuming process that demands highly specialized
domain expertise. Automating this process is challenging because graphical
editing requires performing a variety of tasks, each requiring distinct skill
sets. Recently, vision-language models (VLMs) have emerged as a powerful
framework for automating the editing process, but their development and
evaluation are bottlenecked by the lack of a comprehensive benchmark that
requires human-level perception and presents real-world editing complexity. In
this work, we present BlenderGym, the first comprehensive VLM system benchmark
for 3D graphics editing. BlenderGym evaluates VLM systems through code-based 3D
reconstruction tasks. We evaluate closed- and open-source VLM systems and
observe that even the state-of-the-art VLM system struggles with tasks
relatively easy for human Blender users. Enabled by BlenderGym, we study how
inference scaling techniques impact VLM's performance on graphics editing
tasks. Notably, our findings reveal that the verifier used to guide the scaling
of generation can itself be improved through inference scaling, complementing
recent insights on inference scaling of LLM generation in coding and math
tasks. We further show that inference compute is not uniformly effective and
can be optimized by strategically distributing it between generation and
verification.Summary
AI-Generated Summary