FLAT: 기하학적으로 정확한 장면 생성을 위한 피드포워드 잠재 삼각형 스플래팅

초록

단일 이미지로부터 탐색 가능한 3D 장면을 생성하려면 강력한 생성적 사전 지식과 다운스트림 활용에 적합한 정확한 기하학적 표현이 필요하다. 현재 비디오 확산 모델은 고품질 생성을 제공하며 잠재 공간에서 다중 시점 기하 구조를 암시적으로 인코딩한다. 그러나 기존의 피드포워드 잠재 장면 디코더는 일반적으로 명확한 표면이 정의되지 않은 체적 3D 가우시안을 출력하므로 시뮬레이션이나 표준 그래픽 파이프라인에서의 사용이 제한된다. 이에 따라 렌더링이 가능할 뿐만 아니라 명시적 기하 자산에 더 가까운 표면 정렬 프리미티브를 디코딩할 필요성이 제기된다. 본 연구에서는 압축된 비디오 확산 잠재 변수를 단일 패스로 명시적 표면 프리미티브에 직접 매핑할 수 있는지 질문한다. 이를 위해 FLAT을 도입하며, 처음으로 비디오 확산 잠재 변수로부터 삼각형 스플랫을 직접 디코딩할 수 있음을 보여준다. 3D 가우시안을 디코딩하는 것과 비교하여 평면 프리미티브를 예측하는 것은 프리미티브 방향에 대한 높은 민감성으로 인해 훨씬 더 어려운 것으로 악명 높으며, 종종 기울기 흐름이 좋지 않다. FLAT은 두 가지 핵심 요소, 즉 삼각형 회귀를 위한 광선 중심 회전 매개변수화와 미분 가능 삼각형 렌더링 중 기울기 흐름을 개선하는 새로운 곱 윈도우 함수를 통해 이 문제를 해결한다. 표준 벤치마크에서 FLAT은 최신 피드포워드 기준선과 비교하여 경쟁력 있는 시각적 품질을 유지하면서 기하학적 정확도를 크게 향상시킨다. 또한 가벼운 테스트 시점 정제 단계를 통해 예측된 삼각형 집합을 완전히 불투명한 게임 엔진 준비 표현으로 변환하여 실시간 렌더링을 지원함을 보여준다. 동일한 학습 설정 하에서 3DGS, 2DGS 및 삼각형 스플랫 변형을 평가함으로써 피드포워드 장면 생성에서 표현의 트레이드오프에 대한 최초의 체계적 분석을 제공한다. 프로젝트 페이지는 https://flat-splat.github.io 에서 확인할 수 있다.

English

Generating explorable 3D scenes from a single image requires strong generative priors and accurate geometric representations suitable for downstream use. Current video diffusion models offer high-quality generation and implicitly encode multi-view geometric structure in latent space. However, existing feedforward latent scene decoders typically output volumetric 3D Gaussians that lack a well-defined surface, limiting their use in simulation or standard graphics pipelines. This motivates decoding surface-aligned primitives that are not only renderable but also closer to explicit geometric assets. We ask whether compressed video diffusion latents can be mapped directly to explicit surface primitives in a single pass. To this end, we introduce FLAT and, for the first time, show that triangle splats can be decoded directly from video diffusion latents. Compared with decoding 3D Gaussians, predicting flat primitives is notoriously more challenging due to high sensitivity to primitive orientations, oftentimes leading to poor gradient flow. FLAT solves with two key ingredients: a ray-centered rotation parameterization for triangle regression and a novel product window function that improves gradient flow during differentiable triangle rendering. On standard benchmarks, FLAT achieves significantly better geometric accuracy while maintaining competitive visual quality compared to state-of-the-art feedforward baselines. We further show that a lightweight test-time refinement step converts the predicted triangle soup into a fully opaque, game-engine-ready representation that supports real-time rendering. By evaluating 3DGS, 2DGS, and triangle splatting variants under an identical training setup, we provide the first systematic analysis of representation tradeoffs in feedforward scene generation. The project page is available at https://flat-splat.github.io