對齊新視角影像與幾何合成：基於跨模態注意力注入

摘要

我們提出了一種基於擴散的框架，該框架通過扭曲與修復的方法實現了對齊的新視角圖像與幾何生成。與以往需要密集姿態圖像或受限於域內視角的姿態嵌入生成模型不同，我們的方法利用現成的幾何預測器來預測從參考圖像中觀察到的部分幾何，並將新視角合成表述為圖像和幾何的修復任務。為了確保生成圖像與幾何之間的精確對齊，我們提出了跨模態注意力蒸餾，在訓練和推理過程中，將來自圖像擴散分支的注意力圖注入到並行的幾何擴散分支中。這種多任務方法實現了協同效應，促進了幾何上穩健的圖像合成以及定義良好的幾何預測。我們進一步引入了基於鄰近度的網格條件化，以整合深度和法線線索，在點雲之間進行插值，並過濾錯誤預測的幾何，避免其影響生成過程。實驗表明，我們的方法在各種未見場景中實現了高保真的外推視角合成，在插值設置下提供了具有競爭力的重建質量，並生成了幾何對齊的彩色點雲，用於全面的3D補全。項目頁面可在https://cvlab-kaist.github.io/MoAI 訪問。

English

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.