TriSplat：可即用於模擬的前饋式三維場景重建

摘要

稀疏視角三維重建正日益透過前饋噴濺網絡來解決，此類網絡可直接從影像預測顯式基元。然而，現有方法大多仍以高斯基元為核心，且僅間接暴露表面：要提取可供下游模擬、物理推理或具身交互使用的可用網格，仍需昂貴的後處理步驟，這違背了前饋機制的承諾。在無位姿設定中，此限制尤為突出——場景結構與相機參數必須從稀疏觀測中聯合估計。我們提出TriSplat，這是一個前饋重建網絡，以有向三角形基元表示場景，並透過單次前向傳播直接輸出模擬就緒的網格場景。給定輸入影像，該網絡預測局部三維點圖、三角形屬性、相機位姿及可選內參。不同於將三角形方向回歸為無約束潛變量，我們的方法從預測的點圖構建幾何法線，經由影像條件法線頭進行細化，並將其轉換為穩定的局部幀以用於三角形參數化。單法線自舉調度進一步穩定早期訓練，而不透明度與模糊調度則逐步銳化所學的表面表徵，以實現直接網格提取。在RealEstate10K與DL3DV上的實驗表明，相較於高斯前饋基線，此表徵能產出更忠於幾何的重建，同時維持具有競爭力的新視角渲染品質。由於渲染基元本身就是表面三角形，輸出可直接被物理引擎、碰撞檢測器及標準渲染管線使用，無需任何轉換，使其成為前饋三維場景重建的實用模擬就緒解決方案。

English

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.