EscherNet:用於可擴展視角合成的生成模型
EscherNet: A Generative Model for Scalable View Synthesis
February 6, 2024
作者: Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiaojuan Qi, Andrew J. Davison
cs.AI
摘要
我們介紹 EscherNet,一種用於視圖合成的多視角條件擴散模型。EscherNet 學習隱式且生成的 3D 表示,結合專用的相機位置編碼,允許在任意數量的參考視圖和目標視圖之間精確且連續地控制相機變換。EscherNet 在視圖合成中提供了卓越的通用性、靈活性和可擴展性 — 即使是在使用固定數量的 3 個參考視圖到 3 個目標視圖進行訓練的情況下,它也能在單個消費級 GPU 上同時生成超過 100 個一致的目標視圖。因此,EscherNet 不僅解決了零樣本新視圖合成的問題,還自然地將單張和多張圖像的 3D 重建統一起來,將這些不同的任務結合成一個統一的框架。我們的廣泛實驗表明,EscherNet 在多個基準測試中實現了最先進的性能,即使與專門針對每個個別問題定制的方法進行比較也是如此。這種卓越的多功能性為設計可擴展的 3D 視覺神經架構開辟了新的方向。項目頁面:https://kxhit.github.io/EscherNet。
English
We introduce EscherNet, a multi-view conditioned diffusion model for view
synthesis. EscherNet learns implicit and generative 3D representations coupled
with a specialised camera positional encoding, allowing precise and continuous
relative control of the camera transformation between an arbitrary number of
reference and target views. EscherNet offers exceptional generality,
flexibility, and scalability in view synthesis -- it can generate more than 100
consistent target views simultaneously on a single consumer-grade GPU, despite
being trained with a fixed number of 3 reference views to 3 target views. As a
result, EscherNet not only addresses zero-shot novel view synthesis, but also
naturally unifies single- and multi-image 3D reconstruction, combining these
diverse tasks into a single, cohesive framework. Our extensive experiments
demonstrate that EscherNet achieves state-of-the-art performance in multiple
benchmarks, even when compared to methods specifically tailored for each
individual problem. This remarkable versatility opens up new directions for
designing scalable neural architectures for 3D vision. Project page:
https://kxhit.github.io/EscherNet.