L3GO: 비전통적 객체 생성을 위한 3D 사고 체인 기반 언어 에이전트

초록

DALL-E 3 및 Stable Diffusion-XL과 같은 확산 기반 이미지 생성 모델은 사실적이고 독창적인 구성을 가진 이미지를 생성하는 데 있어 뛰어난 능력을 보여줍니다. 그러나 이러한 모델들은 물리적 및 공간적 객체 구성을 정확하게 추론하는 데 있어서는 강건하지 못하며, 특히 "다섯 개의 다리를 가진 의자"와 같은 비전통적이면서도 분포 외(Out-of-Distribution) 설명을 지시받을 때 더욱 취약합니다. 본 논문에서는 이러한 문제를 해결하기 위해 체인-오브-3D-생각(Chain-of-3D-Thoughts, L3GO)을 갖춘 언어 에이전트를 제안합니다. 이는 현재의 데이터 기반 확산 모델들이 어려워하는 비전통적 객체의 부분 기반 3D 메시 생성을 추론할 수 있는 추론 시점 접근법입니다. 구체적으로, 우리는 대형 언어 모델을 에이전트로 활용하여 3D 시뮬레이션 환경 내에서 시행착오를 통해 원하는 객체를 구성합니다. 이를 위해 우리는 새로운 벤치마크인 비전통적 가능 객체(Unconventionally Feasible Objects, UFO)와 Blender 위에 구축된 SimpleBlenv라는 래퍼 환경을 개발했습니다. 이 환경에서 언어 에이전트는 API 호출을 통해 원자적 빌딩 블록을 구성하고 조립할 수 있습니다. 인간 및 자동화된 GPT-4V 평가 결과, 우리의 접근법은 ShapeNet에서의 3D 메시 생성에 있어 표준 GPT-4 및 ReAct, Reflexion과 같은 다른 언어 에이전트들을 능가하는 것으로 나타났습니다. 또한, UFO 벤치마크에서 테스트했을 때, 우리의 접근법은 인간 평가를 기반으로 한 최신 텍스트-투-2D 이미지 및 텍스트-투-3D 모델들을 능가하는 성능을 보였습니다.

English

Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as "a chair with five legs". In this paper, we propose a language agent with chain-of-3D-thoughts (L3GO), an inference-time approach that can reason about part-based 3D mesh generation of unconventional objects that current data-driven diffusion models struggle with. More concretely, we use large language models as agents to compose a desired object via trial-and-error within the 3D simulation environment. To facilitate our investigation, we develop a new benchmark, Unconventionally Feasible Objects (UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender where language agents can build and compose atomic building blocks via API calls. Human and automatic GPT-4V evaluations show that our approach surpasses the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our approach outperforms other state-of-the-art text-to-2D image and text-to-3D models based on human evaluation.

L3GO: 비전통적 객체 생성을 위한 3D 사고 체인 기반 언어 에이전트

L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects

초록

Support