是什么让基于稳定扩散的文本到360度全景图生成成为可能？

摘要

近期，文本到图像扩散模型（如Stable Diffusion）的蓬勃发展，激发了将其应用于360度全景图生成的研究。先前的工作已证明，在预训练的扩散模型上采用传统的低秩适应技术生成全景图像是可行的。然而，透视图像与全景图像之间显著的领域差异，引发了关于支撑这一实证成功背后机制的疑问。我们提出并验证了一个假设：在针对全景数据进行微调时，可训练组件展现出独特的行为，这种适应隐藏了利用预训练扩散模型内先验知识的某种内在机制。我们的分析揭示了两点关键发现：1）注意力模块中的查询和键矩阵负责的是全景与透视域之间可共享的通用信息，因此与全景生成关联较小；2）值矩阵和输出权重矩阵则专注于将预训练知识适配至全景域，在全景生成微调过程中扮演着更为关键的角色。我们通过引入一个名为UniPano的简洁框架，实证验证了这些洞见，旨在为未来研究树立一个优雅的基准。UniPano不仅超越了现有方法，而且相较于先前的双分支方案，显著降低了内存占用与训练时间，使其能够高效扩展至更高分辨率的端到端全景生成。相关代码即将发布。

English

Recent prosperity of text-to-image diffusion models, e.g. Stable Diffusion, has stimulated research to adapt them to 360-degree panorama generation. Prior work has demonstrated the feasibility of using conventional low-rank adaptation techniques on pre-trained diffusion models to generate panoramic images. However, the substantial domain gap between perspective and panoramic images raises questions about the underlying mechanisms enabling this empirical success. We hypothesize and examine that the trainable counterparts exhibit distinct behaviors when fine-tuned on panoramic data, and such an adaptation conceals some intrinsic mechanism to leverage the prior knowledge within the pre-trained diffusion models. Our analysis reveals the following: 1) the query and key matrices in the attention modules are responsible for common information that can be shared between the panoramic and perspective domains, thus are less relevant to panorama generation; and 2) the value and output weight matrices specialize in adapting pre-trained knowledge to the panoramic domain, playing a more critical role during fine-tuning for panorama generation. We empirically verify these insights by introducing a simple framework called UniPano, with the objective of establishing an elegant baseline for future research. UniPano not only outperforms existing methods but also significantly reduces memory usage and training time compared to prior dual-branch approaches, making it scalable for end-to-end panorama generation with higher resolution. The code will be released.

是什么让基于稳定扩散的文本到360度全景图生成成为可能？

What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?

摘要

Support