FLAT: フィードフォワード潜在三角形スプラッティングによる幾何学的に正確なシーン生成

要旨

単一画像から探索可能な3Dシーンを生成するには、強力な生成的事前知識と、下流用途に適した正確な幾何学的表現が必要となる。現在のビデオ拡散モデルは高品質な生成を実現し、潜在空間に多視点幾何構造を暗黙的に符号化している。しかし、既存のフィードフォワード型潜在シーンデコーダは通常、明確な表面を持たないボリューメトリックな3Dガウス分布を出力するため、シミュレーションや標準的なグラフィックスパイプラインでの利用が制限される。このことから、レンダリング可能であるだけでなく、明示的な幾何アセットにより近い、表面に沿ったプリミティブを復号することが動機となる。本研究では、圧縮されたビデオ拡散潜在表現を直接、明示的な表面プリミティブに単一パスでマッピングできるかを問う。この目的のために、我々はFLATを導入し、初めてビデオ拡散潜在表現から直接三角形スプラットを復号できることを示す。3Dガウス分布の復号と比較して、平坦なプリミティブの予測は、プリミティブの向きに対する感度が高く、勾配の流れが悪くなりがちであるため、格段に困難である。FLATは、2つの重要な要素によってこれを解決する：三角形回帰のためのレイ中心の回転パラメータ化と、微分可能な三角形レンダリング中の勾配流を改善する新しい積窓関数（product window function）である。標準的なベンチマークにおいて、FLATは最先端のフィードフォワードベースラインと比較して、競争力のある視覚品質を維持しつつ、幾何学的精度を大幅に向上させる。さらに、軽量なテスト時最適化ステップにより、予測された三角形スープを、完全に不透明でゲームエンジン対応の表現に変換し、リアルタイムレンダリングをサポートすることを示す。同一の訓練設定で3DGS、2DGS、および三角形スプラットの各変種を評価することにより、フィードフォワード型シーン生成における表現のトレードオフに関する初の体系的解析を提供する。プロジェクトページは https://flat-splat.github.io で公開されている。

English

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io