LATTICE：规模化实现高保真3D生成民主化

摘要

我们提出LATTICE——一个用于高保真3D资产生成的新框架，它弥合了3D与2D生成模型在质量与可扩展性之间的差距。尽管2D图像合成得益于固定空间网格和完善的Transformer架构，但3D生成由于需要从零开始预测空间结构与细节几何表面，仍面临更根本性的挑战。现有3D表示方法的计算复杂性，以及缺乏结构化、可扩展的3D资产编码方案，进一步加剧了这些挑战。为此，我们提出VoxSet这种半结构化表示方法，它将3D资产压缩为锚定于粗粒度体素网格的紧凑隐向量集合，实现高效且位置感知的生成。VoxSet在保留先前VecSet方法简洁性与压缩优势的同时，为隐空间引入显式结构，使位置嵌入能指导生成过程，并支持强健的令牌级测试时缩放。基于此表示方法，LATTICE采用两阶段流程：首先生成稀疏体素化几何锚点，随后通过修正流Transformer生成细节几何。我们的方法核心简洁，但支持任意分辨率解码、低成本训练和灵活推理方案，在多项指标上达到最先进性能，为可扩展的高质量3D资产创建迈出重要一步。

English

We present LATTICE, a new framework for high-fidelity 3D asset generation that bridges the quality and scalability gap between 3D and 2D generative models. While 2D image synthesis benefits from fixed spatial grids and well-established transformer architectures, 3D generation remains fundamentally more challenging due to the need to predict both spatial structure and detailed geometric surfaces from scratch. These challenges are exacerbated by the computational complexity of existing 3D representations and the lack of structured and scalable 3D asset encoding schemes. To address this, we propose VoxSet, a semi-structured representation that compresses 3D assets into a compact set of latent vectors anchored to a coarse voxel grid, enabling efficient and position-aware generation. VoxSet retains the simplicity and compression advantages of prior VecSet methods while introducing explicit structure into the latent space, allowing positional embeddings to guide generation and enabling strong token-level test-time scaling. Built upon this representation, LATTICE adopts a two-stage pipeline: first generating a sparse voxelized geometry anchor, then producing detailed geometry using a rectified flow transformer. Our method is simple at its core, but supports arbitrary resolution decoding, low-cost training, and flexible inference schemes, achieving state-of-the-art performance on various aspects, and offering a significant step toward scalable, high-quality 3D asset creation.