通过跨模态注意力注入实现对齐的新视角图像与几何合成

摘要

我们提出了一种基于扩散的框架，通过扭曲与修复的方法实现对齐的新视角图像与几何生成。与以往需要密集姿态图像或局限于域内视角的嵌入姿态生成模型不同，我们的方法利用现成的几何预测器从参考图像预测部分几何，并将新视角合成任务表述为图像与几何的修复问题。为确保生成图像与几何间的精确对齐，我们提出了跨模态注意力蒸馏机制，在训练和推理过程中，将图像扩散分支的注意力图注入到并行的几何扩散分支中。这种多任务方法实现了协同效应，促进了几何鲁棒的图像合成以及清晰的几何预测。此外，我们引入了基于邻近度的网格条件化，整合深度与法线线索，在点云间进行插值，并过滤错误预测的几何，避免其影响生成过程。实验表明，我们的方法在多种未见场景下实现了高保真的外推视角合成，在插值设置下提供了具有竞争力的重建质量，并生成了几何对齐的彩色点云，用于全面的三维补全。项目页面详见https://cvlab-kaist.github.io/MoAI。

English

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in-domain views, our method leverages off-the-shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel-view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between generated images and geometry, we propose cross-modal attention distillation, where attention maps from the image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating geometrically robust image synthesis as well as well-defined geometry prediction. We further introduce proximity-based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis on both image and geometry across a range of unseen scenes, delivers competitive reconstruction quality under interpolation settings, and produces geometrically aligned colored point clouds for comprehensive 3D completion. Project page is available at https://cvlab-kaist.github.io/MoAI.

通过跨模态注意力注入实现对齐的新视角图像与几何合成

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

摘要

Support