GaussianGPT: 자기회귀적 3D 가우시안 장면 생성 방향으로

초록

3D 생성 모델링의 최신 발전은 대부분 확산(diffusion) 또는 플로우 매칭(flow-matching) 방식을 기반으로 합니다. 본 연구는 이에 대한 완전한 자가회귀(autoregressive) 대안을 탐구하며, 다음 토큰 예측을 통해 3D 가우시안을 직접 생성하여 완전한 3D 장면 생성을 가능하게 하는 트랜스포머 기반 모델인 GaussianGPT를 소개합니다. 먼저, 벡터 양자화를 적용한 희소 3D 합성곱 오토인코더를 사용하여 가우시안 기본 요소를 이산 잠재 그리드로 압축합니다. 생성된 토큰은 직렬화되고 3D 회전 위치 임베딩을 적용한 인과적 트랜스포머로 모델링되어 공간 구조와 외관의 순차적 생성을 가능하게 합니다. 장면을 전체적으로 정제하는 확산 기반 방법과 달리, 우리의 방식은 단계별로 장면을 구성하여 자연스럽게 완성, 아웃페인팅, 온도를 통한 제어 가능한 샘플링, 유연한 생성 범위를 지원합니다. 이 공식은 명시적 표현 위에서 작동하면서 자가회귀 모델링의 구성적 귀납적 편향과 확장성을 활용하며, 현대 신경 렌더링 파이프라인과 호환되어 자가회귀 트랜스포머를 제어 가능하고 맥락을 인지하는 3D 생성을 위한 상호 보완적 패러다임으로 위치시킵니다.

English

Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.

GaussianGPT: 자기회귀적 3D 가우시안 장면 생성 방향으로

GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

초록

Support