FLUX3D: 확산 정합 희소 표현을 이용한 고충실도 3D 가우시안 생성

초록

희소 복셀 표현은 이미지-3D 가우시안 스플래팅(3DGS) 생성을 위한 확장 가능한 기반으로 부상했지만, 현재 방법들은 두 가지 구조적 병목 현상으로 인해 입력 이미지의 고주파 시각적 디테일을 보존하는 데 어려움을 겪고 있다. 첫째, 의미적 추상화에 최적화된 판별적 2D 특징을 채택하여 희소 복셀 잠재 변수를 구성함으로써 재구성 신호를 억제하고 표현 병목 현상을 유발한다. 둘째, 생성 단계에서 표준 확산 트랜스포머는 조밀한 2D 이미지 토큰과 희소 3D 복셀 잠재 변수를 정렬하는 효과적인 메커니즘이 부족하여 교차 모달 대응 병목 현상을 초래한다. 이러한 문제를 해결하기 위해 우리는 FLUX3D를 제안한다. 이는 생성 과정에서 표현 학습과 교차 모달 정렬을 모두 향상시키는 확장 가능한 이미지-3DGS 프레임워크이다. 먼저 희소 복셀 기반 3D 표현 학습을 위한 2D 특징 선택을 재검토하고, 확산 정렬 구조화 잠재 변수(DA-SLAT)를 제안하며 이를 디코더 전용 아키텍처와 결합하여 3DGS 재구성 충실도를 개선한다. 또한 희소 구조 인식 확산 프레임워크를 설계하여 희소 구조 다중 모달 확산 트랜스포머(SMDiT)와 모달 인식 회전 위치 임베딩(MARoPE)을 통합함으로써 기하학에 구애받지 않는 2D-3D 정렬을 달성한다. 광범위한 벤치마크 실험을 통해 FLUX3D가 외관 충실도에서 상당한 개선을 가져오며 고품질 3DGS 자산 생성에 있어 모든 최신(SOTA) 방법보다 현저히 뛰어난 성능을 보임을 입증한다.

English

Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.