TriSplat: 시뮬레이션에 바로 사용할 수 있는 피드포워드 3D 장면 복원

초록

스파스 뷰 3차원 재구성은 이미지로부터 직접 명시적 프리미티브를 예측하는 피드포워드 스플래팅 네트워크를 통해 점점 더 많이 다루어지고 있다. 그러나 대부분의 기존 방법은 가우시안 프리미티브에 초점을 맞추고 표면을 간접적으로만 노출시킨다. 즉, 다운스트림 시뮬레이션, 물리 추론 또는 체화된 상호작용을 위해 사용 가능한 메시를 추출하려면 여전히 피드포워드의 장점을 무효화하는 비용이 많이 드는 후처리 단계가 필요하다. 이러한 한계는 장면 구조와 카메라 파라미터를 희소 관측치로부터 공동으로 추정해야 하는 포즈 미지정 설정에서 특히 두드러진다. 본 논문은 방향성 삼각형 프리미티브로 장면을 표현하고 단일 순방향 패스로 시뮬레이션 준비가 완료된 메시 장면을 직접 내보내는 피드포워드 재구성 네트워크인 TriSplat을 제시한다. 입력 이미지가 주어지면 네트워크는 로컬 3D 포인트 맵, 삼각형 속성, 카메라 포즈 및 선택적 내부 파라미터를 예측한다. 삼각형 방향을 제약 없는 잠재 변수로 회귀하는 대신, 본 접근법은 예측된 포인트 맵으로부터 기하 법선을 구성하고, 이미지 조건부 법선 헤드로 이를 정제한 후, 삼각형 파라미터화를 위한 안정적인 로컬 프레임으로 변환한다. 단일 법선 부트스트랩 스케줄은 초기 훈련을 더욱 안정화시키고, 불투명도 및 블러 스케줄링은 직접 메시 추출을 위해 학습된 표면 표현을 점진적으로 선명하게 만든다. RealEstate10K 및 DL3DV 데이터셋에 대한 실험은 이 표현이 가우시안 피드포워드 기준선보다 기하학적으로 더 충실한 재구성을 생성하면서도 경쟁력 있는 새로운 시점 렌더링 품질을 유지함을 보여준다. 렌더링 프리미티브 자체가 표면 삼각형이므로, 출력은 변환 없이 물리 엔진, 충돌 감지기 및 표준 렌더링 파이프라인에 직접 입력될 수 있어 피드포워드 3D 장면 재구성을 위한 실용적인 시뮬레이션 준비 솔루션이 된다.

English

Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.