自己回帰モデルを活用した3D形状生成の能力と拡張性の限界に挑む

要旨

自己回帰モデルは、グリッド空間における結合分布をモデル化することで、2D画像生成において印象的な成果を上げてきました。本論文では、自己回帰モデルを3D領域に拡張し、容量とスケーラビリティの両面で自己回帰モデルを改善することで、より強力な3D形状生成能力を追求します。まず、大規模モデルの学習を促進するために、公開されている3Dデータセットのアンサンブルを活用します。これには、メッシュ、ポイント、ボクセル、レンダリング画像、テキストキャプションなど、複数のプロパティを持つ約90万個のオブジェクトが含まれており、この多様なラベル付きデータセット「Objaverse-Mix」により、モデルは幅広いオブジェクトのバリエーションから学習することが可能になります。しかし、3D自己回帰を直接適用すると、ボリュームグリッドに対する高い計算要求と、グリッド次元に沿った曖昧な自己回帰順序という重大な課題に直面し、3D形状の品質が低下します。この問題に対処するため、容量の観点から新しいフレームワーク「Argus3D」を提案します。具体的には、ボリュームグリッドではなく潜在ベクトルに基づく離散表現学習を導入し、計算コストを削減するとともに、より扱いやすい順序で結合分布を学習することで、重要な幾何学的詳細を保持します。これにより、点群、カテゴリ、画像、テキストなどの様々な条件付け入力を潜在ベクトルに単純に連結することで、条件付き生成の容量を実現できます。さらに、モデルアーキテクチャのシンプルさにより、36億パラメータという大規模なモデルに自然にスケールアップし、多様な3D生成の品質をさらに向上させます。4つの生成タスクにおける広範な実験により、Argus3Dが複数のカテゴリにわたって多様で忠実な形状を合成し、顕著な性能を達成できることが実証されました。

English

Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance.

自己回帰モデルを活用した3D形状生成の能力と拡張性の限界に挑む

Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability

要旨

Support