자율회귀 모델의 용량과 확장성을 극대화한 3D 형태 생성 연구

초록

자기회귀 모델은 그리드 공간에서 결합 분포를 모델링함으로써 2D 이미지 생성에서 인상적인 결과를 달성해 왔습니다. 본 논문에서는 자기회귀 모델을 3D 도메인으로 확장하고, 모델의 용량과 확장성을 동시에 개선하여 더 강력한 3D 형태 생성 능력을 추구합니다. 먼저, 대규모 모델 학습을 촉진하기 위해 공개된 3D 데이터셋의 앙상블을 활용합니다. 이는 약 900,000개의 객체로 구성된 포괄적인 컬렉션으로, 메시, 포인트, 복셀, 렌더링된 이미지, 텍스트 캡션 등 다양한 속성을 포함합니다. 이렇게 다양한 라벨이 지정된 데이터셋인 Objaverse-Mix는 우리의 모델이 광범위한 객체 변형을 학습할 수 있도록 지원합니다. 그러나 3D 자기회귀를 직접 적용하는 것은 복셀 그리드에 대한 높은 계산 요구와 그리드 차원을 따른 모호한 자기회귀 순서라는 중요한 문제에 직면하게 되어, 3D 형태의 품질이 저하됩니다. 이를 해결하기 위해, 우리는 용량 측면에서 새로운 프레임워크인 Argus3D를 제시합니다. 구체적으로, 우리의 접근 방식은 복셀 그리드 대신 잠재 벡터 기반의 이산 표현 학습을 도입하여, 계산 비용을 줄이는 동시에 더 다루기 쉬운 순서로 결합 분포를 학습함으로써 필수적인 기하학적 세부 사항을 보존합니다. 이에 따라, 포인트 클라우드, 카테고리, 이미지, 텍스트와 같은 다양한 조건 입력을 잠재 벡터에 간단히 연결함으로써 조건부 생성의 용량을 실현할 수 있습니다. 또한, 우리 모델 아키텍처의 단순성 덕분에, 우리는 이 접근 방식을 36억 개의 매개변수를 가진 더 큰 모델로 자연스럽게 확장하여 다재다능한 3D 생성의 품질을 더욱 향상시킵니다. 네 가지 생성 작업에 대한 광범위한 실험을 통해 Argus3D가 여러 카테고리에서 다양하고 충실한 형태를 합성할 수 있으며, 뛰어난 성능을 달성함을 입증합니다.

English

Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.

자율회귀 모델의 용량과 확장성을 극대화한 3D 형태 생성 연구

Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

초록

Support