MVDiffusion：實現具備對應感知擴散模型的整體多視角影像生成

摘要

本文提出MVDiffusion——一種針對具備像素級對應關係場景（如全景圖透視裁切或已知幾何資訊的多視角圖像）的簡潔高效多視角圖像生成方法。有別於依賴迭代式圖像變形與修補的既有模型，MVDiffusion通過全局感知機制並行生成所有圖像，兼具高解析度與豐富內容，有效解決了前人模型中常見的誤差累積問題。該方法特別融入對應感知注意力機制，實現高效的跨視角交互。此機制支撐三個關鍵模組：1) 生成模組：在保持全局對應關係的同時生成低解析度圖像；2) 插值模組：對圖像間空間覆蓋進行稠密化處理；3) 超解析度模組：將圖像提升至高解析度輸出。針對全景圖像生成，MVDiffusion可生成達1024×1024像素的高解析度寫實圖像。在幾何條件約束的多視角圖像生成任務中，本方法首次實現了場景網格的紋理貼圖生成。項目頁面請訪問：https://mvdiffusion.github.io。

English

This paper introduces MVDiffusion, a simple yet effective multi-view image generation method for scenarios where pixel-to-pixel correspondences are available, such as perspective crops from panorama or multi-view images given geometry (depth maps and poses). Unlike prior models that rely on iterative image warping and inpainting, MVDiffusion concurrently generates all images with a global awareness, encompassing high resolution and rich content, effectively addressing the error accumulation prevalent in preceding models. MVDiffusion specifically incorporates a correspondence-aware attention mechanism, enabling effective cross-view interaction. This mechanism underpins three pivotal modules: 1) a generation module that produces low-resolution images while maintaining global correspondence, 2) an interpolation module that densifies spatial coverage between images, and 3) a super-resolution module that upscales into high-resolution outputs. In terms of panoramic imagery, MVDiffusion can generate high-resolution photorealistic images up to 1024times1024 pixels. For geometry-conditioned multi-view image generation, MVDiffusion demonstrates the first method capable of generating a textured map of a scene mesh. The project page is at https://mvdiffusion.github.io.

MVDiffusion：實現具備對應感知擴散模型的整體多視角影像生成

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

摘要

Support