EscherNet：用于可扩展视图合成的生成模型

摘要

我们介绍 EscherNet，这是一个用于视图合成的多视角条件扩散模型。EscherNet 学习隐式和生成式的 3D 表示，结合专门的摄像头位置编码，允许在任意数量的参考视图和目标视图之间精确连续地控制摄像头变换。EscherNet 在视图合成中提供了出色的通用性、灵活性和可扩展性 -- 即使是在使用固定数量的 3 个参考视图到 3 个目标视图进行训练的情况下，它也能在单个消费级 GPU 上同时生成超过 100 个一致的目标视图。因此，EscherNet 不仅解决了零样本新视图合成问题，还自然地将单图和多图像 3D 重建统一起来，将这些多样的任务结合到一个统一的框架中。我们广泛的实验证明，EscherNet 在多个基准测试中取得了最先进的性能，即使与专门针对每个单独问题的方法进行比较也是如此。这种卓越的多功能性为设计可扩展的用于 3D 视觉的神经架构开辟了新的方向。项目页面：https://kxhit.github.io/EscherNet。

English

We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis -- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet.

EscherNet：用于可扩展视图合成的生成模型

EscherNet: A Generative Model for Scalable View Synthesis

摘要

Support