한 번의 시점으로 충분하다! 야생 환경에서의 단일 영상 기반 새로운 시점 생성 학습

초록

단안 시점 신시점 합성은 오랫동안 지도 학습을 위해 다중 시점 이미지 쌍을 필요로 하여 훈련 데이터의 규모와 다양성이 제한되어 왔습니다. 우리는 이것이 불필요하다고 주장합니다: 단일 시점으로도 충분합니다. 우리는 전적으로 페어링되지 않은 인터넷 이미지로 훈련된 OVIE를 제시합니다. 우리는 훈련 시 기하학적 기반으로 단안 깊이 추정기를 활용합니다: 소스 이미지를 3D로 변환하고, 샘플링된 카메라 변환을 적용하여 가짜 목표 시점을 얻기 위해 투영합니다. 폐색 영역이 드러난 부분을 처리하기 위해, 기하학적, 지각적, 텍스처 손실을 유효 영역으로 제한하는 마스크 훈련 방식을 도입하여 3천만 개의 비선별 이미지로 훈련이 가능하게 합니다. 추론 시 OVIE는 기하학적 정보가 필요 없어 깊이 추정기나 3D 표현을 요구하지 않습니다. 야외 자연 이미지만으로 훈련된 OVIE는 제로샷 설정에서 기존 방법들을 능가하며, 두 번째로 성능이 좋은 기준 방법보다 600배 빠릅니다. 코드와 모델은 https://github.com/AdrienRR/ovie에서 공개되어 있습니다.

English

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

한 번의 시점으로 충분하다! 야생 환경에서의 단일 영상 기반 새로운 시점 생성 학습

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

초록

Support