FLUX3D: 拡散整列スパース表現を用いた高忠実度3Dガウス生成

要旨

スパースボクセル表現は、画像から3次元ガウシアンスプラッティング（3DGS）生成におけるスケーラブルな基盤として注目されているが、現在の手法は二つの構造的なボトルネックにより、入力画像の高周波の視覚的詳細を保持することが難しい。第一に、セマンティック抽象化に最適化された識別的な2D特徴を採用してスパースボクセル潜在変数を構築するため、再構成的な手がかりが抑制され、表現のボトルネックが生じる。第二に、生成段階において標準的な拡散トランスフォーマーは、密な2D画像トークンとスパースな3Dボクセル潜在変数を効果的に整列させる機構を欠いており、クロスモーダル対応のボトルネックが生じる。これらの問題に対処するため、本稿ではFLUX3Dを提案する。これは、生成時の表現学習とクロスモーダル整列の両方を強化するスケーラブルな画像から3DGSへのフレームワークである。まず、スパースボクセルベースの3D表現学習における2D特徴選択を再検討し、拡散整列構造化潜在変数（DA-SLAT）を提案し、これをデコーダのみのアーキテクチャと組み合わせて3DGS再構成品質を向上させる。さらに、スパース構造認識拡散フレームワークを設計し、スパース構造マルチモーダル拡散トランスフォーマー（SMDiT）とモーダル認識回転位置埋め込み（MARoPE）を統合することで、幾何非依存の2D-3D整列を実現する。広範なベンチマーク実験により、FLUX3Dは外観忠実度において大幅な改善を示し、高品質な3DGSアセット生成において全ての最先端（SOTA）手法を大きく上回ることを実証する。

English

Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.