EscherNet: 확장 가능한 뷰 합성을 위한 생성 모델

초록

우리는 뷰 합성을 위한 다중 뷰 조건부 확산 모델인 EscherNet을 소개합니다. EscherNet은 특화된 카메라 위치 인코딩과 결합된 암묵적이고 생성적인 3D 표현을 학습함으로써, 임의의 수의 참조 뷰와 타겟 뷰 간의 카메라 변환을 정밀하고 연속적으로 제어할 수 있습니다. EscherNet은 뷰 합성에서 탁월한 일반성, 유연성 및 확장성을 제공합니다. 단일 소비자용 GPU에서 100개 이상의 일관된 타겟 뷰를 동시에 생성할 수 있으며, 이는 고정된 수의 3개의 참조 뷰에서 3개의 타겟 뷰로 학습되었음에도 불구하고 가능합니다. 결과적으로, EscherNet은 제로샷 새로운 뷰 합성뿐만 아니라 단일 및 다중 이미지 3D 재구성을 자연스럽게 통합하여 이러한 다양한 작업을 단일의 통합된 프레임워크로 결합합니다. 우리의 광범위한 실험은 EscherNet이 각각의 개별 문제에 특화된 방법들과 비교해도 여러 벤치마크에서 최첨단 성능을 달성함을 보여줍니다. 이 놀라운 다재다능성은 3D 비전을 위한 확장 가능한 신경망 아키텍처 설계에 새로운 방향을 제시합니다. 프로젝트 페이지: https://kxhit.github.io/EscherNet.

English

We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis -- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet.

EscherNet: 확장 가능한 뷰 합성을 위한 생성 모델

EscherNet: A Generative Model for Scalable View Synthesis

초록

Support