單目足矣！面向真實世界新視角生成的單目訓練法

摘要

單目新視角合成長期依賴多視角圖像對進行監督，這限制了訓練數據的規模與多樣性。我們提出單視角即可實現此目標：僅需單一視角圖像即可完成訓練。本文介紹的OVIE模型完全基於非配對的網絡圖像進行訓練。我們在訓練階段採用單目深度估計器作為幾何支架：將源圖像提升至三維空間，施加採樣的相機變換後再投影生成偽目標視角。為處理遮擋解除區域的內容生成，我們提出掩碼訓練機制，將幾何、感知和紋理損失約束於有效區域，從而實現對3000萬張未經篩選圖像的訓練。在推理階段，OVIE無需任何幾何先驗，既不依賴深度估計器也不需三維表徵。僅使用真實場景圖像訓練的OVIE在零樣本設定下超越現有方法，推理速度較次優基準快600倍。代碼與模型已開源於https://github.com/AdrienRR/ovie。

English

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

單目足矣！面向真實世界新視角生成的單目訓練法

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

摘要

Support