FLAT：基于前馈潜在三角形溅射的几何精确场景生成

摘要

从单张图像生成可探索的3D场景需要强大的生成先验和适合下游应用的精确几何表示。当前的视频扩散模型能够生成高质量内容，并在潜在空间中隐式编码多视角几何结构。然而，现有前馈潜在场景解码器通常输出缺乏明确表面的体积3D高斯，限制了其在仿真或标准图形管线中的使用。这促使我们解码不仅可渲染而且更接近显式几何资产的表面对齐图元。我们提出一个问题：压缩后的视频扩散潜在变量是否可以直接通过单次前向传递映射到显式表面图元？为此，我们引入了FLAT，并首次展示了可以从视频扩散潜在变量直接解码三角形薄片（triangle splats）。与解码3D高斯相比，由于对图元方向高度敏感，常常导致梯度流动不佳，预测扁平图元通常更具挑战性。FLAT通过两个关键成分解决了这一问题：一个用于三角形回归的以射线为中心的旋转参数化，以及一个新颖的乘积窗口函数，该函数改进了可微三角形渲染过程中的梯度流动。在标准基准测试上，FLAT在保持与最先进前馈基线相当的视觉质量的同时，实现了显著更好的几何精度。我们进一步证明，一个轻量级的测试时优化步骤可以将预测的三角形片集合（triangle soup）转换为完全不透明、适用于游戏引擎的表示，支持实时渲染。通过在相同的训练设置下评估3DGS、2DGS和三角形薄片变体，我们首次对前馈场景生成中的表示权衡进行了系统分析。项目页面位于 https://flat-splat.github.io

English

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io