TriSplat: シミュレーション対応フィードフォワード3Dシーン再構築

要旨

スパースビュー3D再構成は、画像から直接明示的なプリミティブを予測するフィードフォワード・スプラッティングネットワークによってますます取り組まれている。しかし、既存の手法のほとんどはガウシアンプリミティブに焦点を当てており、表面を間接的にしか露出させない。すなわち、下流のシミュレーション、物理推論、または身体性インタラクションのために利用可能なメッシュを抽出するには、フィードフォワードの利点を損なう高コストな事後処理ステップが依然として必要である。この制限は特にポーズフリー設定で顕著であり、そこではシーン構造とカメラパラメータをスパースな観測から共同で推定しなければならない。本稿では、配向された三角形プリミティブでシーンを表現し、単一の順方向パスからシミュレーション対応のメッシュシーンを直接エクスポートするフィードフォワード再構成ネットワークTriSplatを提案する。入力画像が与えられると、ネットワークは局所的な3D点マップ、三角形属性、カメラポーズ、およびオプションで内部パラメータを予測する。本手法では、三角形の配向を非制約の潜在変数として回帰するのではなく、予測された点マップから幾何法線を構築し、画像条件付き法線ヘッドでそれらを精緻化し、三角形パラメータ化のための安定した局所フレームに変換する。単一法線ブートストラップスケジュールにより初期の訓練がさらに安定化され、不透明度とぼかしのスケジューリングにより学習された表面表現が徐々にシャープになり、直接メッシュ抽出が可能となる。RealEstate10KおよびDL3DVでの実験は、本表現がガウシアンフィードフォワードベースラインよりも幾何学的に忠実な再構成を生成し、かつ競争力のある新規ビューレンダリング品質を維持することを示している。レンダリングプリミティブ自体が表面三角形であるため、出力は変換なしで物理エンジン、衝突検出器、標準的なレンダリングパイプラインに直接取り込むことができ、フィードフォワード3Dシーン再構成のための実用的なシミュレーション対応ソリューションとなる。

English

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.