BlenderGym: 그래픽 편집을 위한 기초 모델 시스템 벤치마킹

초록

3D 그래픽스 편집은 영화 제작 및 게임 디자인과 같은 애플리케이션에서 중요한 역할을 하지만, 여전히 시간이 많이 소요되며 고도의 전문 지식을 요구하는 과정입니다. 이 과정을 자동화하는 것은 도전적인 과제인데, 그래픽 편집은 각기 다른 기술 세트를 요구하는 다양한 작업을 수행해야 하기 때문입니다. 최근, 비전-언어 모델(VLMs)이 편집 과정을 자동화하기 위한 강력한 프레임워크로 등장했지만, 이들의 개발과 평가는 인간 수준의 인식을 요구하고 실제 세계의 편집 복잡성을 제시하는 포괄적인 벤치마크의 부재로 인해 병목 현상을 겪고 있습니다. 본 연구에서는 3D 그래픽스 편집을 위한 첫 번째 포괄적인 VLM 시스템 벤치마크인 BlenderGym을 소개합니다. BlenderGym은 코드 기반 3D 재구성 작업을 통해 VLM 시스템을 평가합니다. 우리는 폐쇄형 및 오픈소스 VLM 시스템을 평가하고, 최첨단 VLM 시스템조차도 인간 Blender 사용자에게는 상대적으로 쉬운 작업에서 어려움을 겪는 것을 관찰했습니다. BlenderGym을 통해 우리는 추론 스케일링 기술이 그래픽스 편집 작업에서 VLM의 성능에 미치는 영향을 연구합니다. 특히, 우리의 연구 결과는 생성의 스케일링을 안내하는 데 사용되는 검증기 자체가 추론 스케일링을 통해 개선될 수 있음을 보여주며, 이는 코딩 및 수학 작업에서 LLM 생성의 추론 스케일링에 대한 최근의 통찰을 보완합니다. 또한, 우리는 추론 컴퓨팅이 균일하게 효과적이지 않으며, 생성과 검증 사이에 전략적으로 분배함으로써 최적화될 수 있음을 보여줍니다.

English

3D graphics editing is crucial in applications like movie production and game design, yet it remains a time-consuming process that demands highly specialized domain expertise. Automating this process is challenging because graphical editing requires performing a variety of tasks, each requiring distinct skill sets. Recently, vision-language models (VLMs) have emerged as a powerful framework for automating the editing process, but their development and evaluation are bottlenecked by the lack of a comprehensive benchmark that requires human-level perception and presents real-world editing complexity. In this work, we present BlenderGym, the first comprehensive VLM system benchmark for 3D graphics editing. BlenderGym evaluates VLM systems through code-based 3D reconstruction tasks. We evaluate closed- and open-source VLM systems and observe that even the state-of-the-art VLM system struggles with tasks relatively easy for human Blender users. Enabled by BlenderGym, we study how inference scaling techniques impact VLM's performance on graphics editing tasks. Notably, our findings reveal that the verifier used to guide the scaling of generation can itself be improved through inference scaling, complementing recent insights on inference scaling of LLM generation in coding and math tasks. We further show that inference compute is not uniformly effective and can be optimized by strategically distributing it between generation and verification.

BlenderGym: 그래픽 편집을 위한 기초 모델 시스템 벤치마킹

BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing

초록

Support