BlenderGym：圖形編輯基礎模型系統的基準測試平台

摘要

3D圖形編輯在電影製作和遊戲設計等應用中至關重要，但這仍然是一個耗時且需要高度專業領域知識的過程。自動化這一過程具有挑戰性，因為圖形編輯需要執行多種任務，每種任務都需要不同的技能組合。最近，視覺語言模型（VLMs）已成為自動化編輯過程的強大框架，但其開發和評估因缺乏一個需要人類水平感知並呈現真實世界編輯複雜性的全面基準而受到限制。在本研究中，我們提出了BlenderGym，這是首個用於3D圖形編輯的全面VLM系統基準。BlenderGym通過基於代碼的3D重建任務來評估VLM系統。我們評估了閉源和開源的VLM系統，並觀察到即使是最先進的VLM系統在對人類Blender用戶相對容易的任務上也表現不佳。借助BlenderGym，我們研究了推理擴展技術如何影響VLM在圖形編輯任務上的表現。值得注意的是，我們的研究結果表明，用於指導生成擴展的驗證器本身可以通過推理擴展來改進，這補充了最近關於LLM生成在編碼和數學任務中推理擴展的見解。我們進一步表明，推理計算並非均勻有效，可以通過在生成和驗證之間策略性地分配來優化。

English

3D graphics editing is crucial in applications like movie production and game design, yet it remains a time-consuming process that demands highly specialized domain expertise. Automating this process is challenging because graphical editing requires performing a variety of tasks, each requiring distinct skill sets. Recently, vision-language models (VLMs) have emerged as a powerful framework for automating the editing process, but their development and evaluation are bottlenecked by the lack of a comprehensive benchmark that requires human-level perception and presents real-world editing complexity. In this work, we present BlenderGym, the first comprehensive VLM system benchmark for 3D graphics editing. BlenderGym evaluates VLM systems through code-based 3D reconstruction tasks. We evaluate closed- and open-source VLM systems and observe that even the state-of-the-art VLM system struggles with tasks relatively easy for human Blender users. Enabled by BlenderGym, we study how inference scaling techniques impact VLM's performance on graphics editing tasks. Notably, our findings reveal that the verifier used to guide the scaling of generation can itself be improved through inference scaling, complementing recent insights on inference scaling of LLM generation in coding and math tasks. We further show that inference compute is not uniformly effective and can be optimized by strategically distributing it between generation and verification.

BlenderGym：圖形編輯基礎模型系統的基準測試平台

BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing

摘要

Support