BlenderAlchemy：使用視覺語言模型編輯3D圖形

摘要

圖形設計對於各種應用至關重要，包括電影製作和遊戲設計。為了創建高質量的場景，設計師通常需要在軟體如Blender中花費數小時，其中他們可能需要交錯並重複操作，例如連接材質節點，數百次。此外，稍有不同的設計目標可能需要完全不同的序列，使自動化變得困難。在本文中，我們提出了一個系統，利用視覺語言模型（VLMs），如GPT-4V，智能地搜索設計動作空間，以達到滿足使用者意圖的答案。具體來說，我們設計了一個基於視覺的編輯生成器和狀態評估器，共同工作以找到正確的行動序列來實現目標。受到人類設計過程中視覺想像力的啟發，我們通過來自圖像生成模型的“想像”參考圖像來補充VLMs的視覺推理能力，提供抽象語言描述的視覺基礎。在本文中，我們提供實證證據表明我們的系統可以為諸如從文本和/或參考圖像編輯程序性材質，以及調整複雜場景中產品渲染的照明配置等任務生成簡單但繁瑣的Blender編輯序列。

English

Graphics design is important for various applications, including movie production and game design. To create a high-quality scene, designers usually need to spend hours in software like Blender, in which they might need to interleave and repeat operations, such as connecting material nodes, hundreds of times. Moreover, slightly different design goals may require completely different sequences, making automation difficult. In this paper, we propose a system that leverages Vision-Language Models (VLMs), like GPT-4V, to intelligently search the design action space to arrive at an answer that can satisfy a user's intent. Specifically, we design a vision-based edit generator and state evaluator to work together to find the correct sequence of actions to achieve the goal. Inspired by the role of visual imagination in the human design process, we supplement the visual reasoning capabilities of VLMs with "imagined" reference images from image-generation models, providing visual grounding of abstract language descriptions. In this paper, we provide empirical evidence suggesting our system can produce simple but tedious Blender editing sequences for tasks such as editing procedural materials from text and/or reference images, as well as adjusting lighting configurations for product renderings in complex scenes.

BlenderAlchemy：使用視覺語言模型編輯3D圖形

BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

摘要

Support