GaussianGPT：オートレグレッシブ3Dガウスシーン生成に向けて

要旨

3次元生成モデリングにおける最新の進歩の多くは、拡散モデルやフローマッチングの定式化に依存している。本研究ではこれらとは異なり、完全に自己回帰的な代替手法を探求し、TransformerベースのモデルであるGaussianGPTを提案する。本モデルはnext-token予測を介して3次元ガウシアンを直接生成し、完全な3次元シーン生成を可能にする。まず、ベクトル量子化を組み込んだスパース3次元畳み込みオートエンコーダを用いて、ガウシアンプリミティブを離散潜在グリッドに圧縮する。得られたトークンは系列化され、3次元のロータリ位置埋め込みを備えた因果的Transformerでモデル化される。これにより、空間構造と外観の逐次的な生成が可能となる。シーンを全体的に精緻化する拡散ベースの手法とは異なり、本提案手法はステップバイステップでシーンを構築する。これにより、シーン補完、アウトペインティング、温度による制御可能なサンプリング、柔軟な生成ホライズンが自然にサポートされる。この定式化は、現代のニューラルレンダリングパイプラインと互換性のある明示的表現上で動作しつつ、自己回帰モデリングの構成的帰納バイアスとスケーラビリティを活用する。これにより、制御可能かつ文脈を考慮した3次元生成における、相補的なパラダイムとしての自己回帰Transformerの可能性を示す。

English

Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.

GaussianGPT：オートレグレッシブ3Dガウスシーン生成に向けて

GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

要旨

Support