GenRecon: 生成事前分布を橋渡しするマルチビュー3Dシーン再構成

要旨

本稿では、多視点RGB画像からの高忠実度3Dシーン再構成に対する新しいアプローチを提案する。このアプローチでは、再構成を強力な生成的3D事前分布と密接に結合する。シーン再構成を、シーン全体をタイル状に覆う空間的に局所化された重なり合うチャンクの集合に対する条件付き3D生成として定式化し、生成を大規模なシーン範囲にスケーリングする。重要なのは、最先端の生成的形状モデル（例としてTrellis.2を使用）の忠実度と完全性を継承し、それをシーンレベルに一般化している点である。このために、投影ベースの条件付け機構を提案する。これは、ポーズ付き多視点画像特徴を、生成モデルに整合した一貫性のある3D表現へと変換するものであり、視点の順序に依存せず、空間的にシーンに固定される。これにより、高忠実度で多視点一貫性のある生成幾何形状が得られる。これにより、Trellis.2の強力なオブジェクトレベルの事前分布を多視点シーンスケール生成に適用することが可能となり、屋内環境の忠実で編集可能なPBRメッシュ再構成を生成する。その結果、最先端の再構成手法を16%上回る高忠実度の結果を得る。

English

We introduce a new approach to high-fidelity 3D scene reconstruction from multi-view RGB images that tightly couples reconstruction with a strong generative 3D prior. We cast scene reconstruction as conditional 3D generation over a set of spatially-localized, overlapping chunks that together tile the scene, scaling generation to large scene extents. Crucially, we inherit the fidelity and completeness of state-of-the-art generative shape models -- we use Trellis.2 as an example -- which we generalize to the scene level. To this end, we propose a projection-based conditioning mechanism that lifts posed multi-view image features into a coherent 3D representation aligned with the generative model, independent of view ordering and spatially anchored to the scene, yielding high-fidelity, multi-view consistent generated geometry. This enables lifting the strong object-level prior of Trellis.2 to multi-view, scene-scale generation, producing faithful, editable PBR mesh reconstructions of indoor environments. As a result, we obtain high-fidelity results that outperform cutting-edge reconstruction methods by 16%.