FLUX3D：基于扩散对齐稀疏表示的高保真3D高斯生成

摘要

稀疏体素表示已成为图像到3D高斯溅射（3DGS）生成的可扩展基础，然而当前方法因两个结构性瓶颈难以保留输入图像的高频视觉细节。首先，现有方法采用针对语义抽象优化的判别式2D特征来构建稀疏体素潜变量，这种方式压制了重建线索并导致表征瓶颈。其次，在生成阶段，标准扩散变换器缺乏有效机制来对齐密集2D图像标记与稀疏3D体素潜变量，引发跨模态对应瓶颈。为解决这些问题，我们提出FLUX3D——一种可扩展的图像到3DGS框架，在生成过程中同时增强表征学习与跨模态对齐。我们首先重新审视基于稀疏体素的3D表征学习中的2D特征选择，提出扩散对齐结构化潜变量（DA-SLAT）并将其与解码器专用架构结合，以提升3DGS重建保真度。此外，我们设计了稀疏结构感知扩散框架，该框架整合了稀疏结构多模态扩散变换器（SMDiT）与模态感知旋转位置嵌入（MARoPE），以实现几何无关的2D-3D对齐。大量基准实验表明，FLUX3D在外观保真度上取得显著提升，并在生成高质量3DGS资产方面全面超越所有现有最优（SOTA）方法。

English

Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.