TriSplat: 面向仿真的前馈式三维场景重建

摘要

稀疏视角3D重建问题越来越多地通过前馈样条网络得到解决，这类网络能够直接从图像预测显式基元。然而，现有方法大多仍以高斯基元为核心，且仅能间接重建表面：为下游仿真、物理推理或具身交互提取可用的网格，仍需进行昂贵的后处理步骤，这打破了前馈网络的承诺。这一局限性在无位姿场景中尤为突出，因为此时场景结构和相机参数必须从稀疏观测中联合估计。我们提出TriSplat，这是一种前馈重建网络，通过定向三角形基元表示场景，并能在单次前向传播中直接导出模拟就绪的网格场景。给定输入图像，网络预测局部3D点图、三角形属性、相机位姿以及可选的相机内参。我们的方法并非将三角形方向回归为无约束隐变量，而是从预测的点图中构建几何法线，通过图像条件法线网络进行精化，并将其转换为稳定的局部坐标系用于三角形参数化。单法线引导策略进一步稳定了早期训练，而不透明度和模糊度调度策略逐步锐化学习到的表面表示，以实现直接的网格提取。在RealEstate10K和DL3DV上的实验表明，与基于高斯的前馈基线方法相比，该表示方法能生成更符合几何结构的重建结果，同时保持有竞争力的新视角渲染质量。由于渲染基元本身就是表面三角形，输出结果可直接被物理引擎、碰撞检测器和标准渲染管线使用，无需任何格式转换，从而为前馈3D场景重建提供了实用的模拟就绪解决方案。

English

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.