何以为文生360度全景图:基于稳定扩散的生成之道
What Makes for Text to 360-degree Panorama Generation with Stable Diffusion?
May 28, 2025
作者: Jinhong Ni, Chang-Bin Zhang, Qiang Zhang, Jing Zhang
cs.AI
摘要
近期,文本到图像扩散模型(如Stable Diffusion)的蓬勃发展,激发了将其应用于360度全景图像生成的研究。先前的工作已证明,通过对预训练扩散模型采用传统的低秩适应技术来生成全景图像是可行的。然而,透视图像与全景图像之间存在的显著领域差异,引发了关于支撑这一实证成功背后机制的疑问。我们提出并验证了这样一个假设:当在全景数据上进行微调时,可训练组件展现出独特的行为,这种适应隐藏了利用预训练扩散模型内先验知识的某种内在机制。我们的分析揭示了以下两点:1)注意力模块中的查询与键矩阵负责的是可在全景与透视领域间共享的通用信息,因此与全景生成的相关性较低;2)值矩阵与输出权重矩阵则专注于将预训练知识适应至全景领域,在全景生成微调过程中扮演着更为关键的角色。我们通过引入一个名为UniPano的简单框架,实证验证了这些见解,旨在为未来研究建立一个优雅的基准。UniPano不仅超越了现有方法,而且相较于先前的双分支方法,显著减少了内存占用与训练时间,使其能够以更高分辨率进行端到端的全景生成。相关代码将予以公开。
English
Recent prosperity of text-to-image diffusion models, e.g. Stable Diffusion,
has stimulated research to adapt them to 360-degree panorama generation. Prior
work has demonstrated the feasibility of using conventional low-rank adaptation
techniques on pre-trained diffusion models to generate panoramic images.
However, the substantial domain gap between perspective and panoramic images
raises questions about the underlying mechanisms enabling this empirical
success. We hypothesize and examine that the trainable counterparts exhibit
distinct behaviors when fine-tuned on panoramic data, and such an adaptation
conceals some intrinsic mechanism to leverage the prior knowledge within the
pre-trained diffusion models. Our analysis reveals the following: 1) the query
and key matrices in the attention modules are responsible for common
information that can be shared between the panoramic and perspective domains,
thus are less relevant to panorama generation; and 2) the value and output
weight matrices specialize in adapting pre-trained knowledge to the panoramic
domain, playing a more critical role during fine-tuning for panorama
generation. We empirically verify these insights by introducing a simple
framework called UniPano, with the objective of establishing an elegant
baseline for future research. UniPano not only outperforms existing methods but
also significantly reduces memory usage and training time compared to prior
dual-branch approaches, making it scalable for end-to-end panorama generation
with higher resolution. The code will be released.Summary
AI-Generated Summary