ワンショットで十分！単眼カメラによる実世界向け新規視点生成の学習

要旨

単眼による新視点合成は、従来マルチビュー画像ペアによる教師付けを必要とし、学習データの規模と多様性を制限してきた。本論文では、これは不要であると主張する：単一の視点で十分なのである。我々は、インターネット上の非対応画像のみで完全に学習されたOVIEを提案する。学習時には、単眼深度推定器を幾何学的な足場として活用する：ソース画像を3D空間にリフトアップし、サンプリングされたカメラ変換を適用し、投影することで擬似ターゲットビューを得る。ディオクルージョンに対処するため、幾何学的、知覚的、テクスチャ的損失を有効領域に制限するマスク付き学習手法を導入し、3千万枚の未整理画像での学習を可能にした。推論時、OVIEは幾何学情報を必要とせず、深度推定器や3D表現を一切必要としない。実世界画像のみで学習されたOVIEは、ゼロショット設定において従来手法を凌駕し、2番目に優れたベースラインと比べて600倍高速である。コードとモデルはhttps://github.com/AdrienRR/ovie で公開されている。

English

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

ワンショットで十分！単眼カメラによる実世界向け新規視点生成の学習

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

要旨

Support