LRM: 単一画像から3Dを生成するための大規模再構成モデル

要旨

単一の入力画像からわずか5秒で物体の3Dモデルを予測する初のLarge Reconstruction Model（LRM）を提案します。ShapeNetなどの小規模データセットでカテゴリ固有の方法で学習されてきた従来の多くの手法とは対照的に、LRMは5億の学習可能なパラメータを持つ高度にスケーラブルなトランスフォーマーベースのアーキテクチャを採用し、入力画像から直接ニューラルラジアンスフィールド（NeRF）を予測します。本モデルは、Objaverseからの合成レンダリングとMVImgNetからの実写キャプチャを含む約100万のオブジェクトからなる大規模なマルチビューデータでエンドツーエンドで学習されます。この高容量モデルと大規模トレーニングデータの組み合わせにより、本モデルは高い汎化性能を発揮し、実世界のワイルドキャプチャや生成モデルからの画像を含む様々なテスト入力から高品質な3D再構成を生成します。ビデオデモやインタラクティブな3Dメッシュは以下のウェブサイトでご覧いただけます：https://yiconghong.me/LRM/。

English

We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs including real-world in-the-wild captures and images from generative models. Video demos and interactable 3D meshes can be found on this website: https://yiconghong.me/LRM/.

LRM: 単一画像から3Dを生成するための大規模再構成モデル

LRM: Large Reconstruction Model for Single Image to 3D

要旨

Support