Lotus:基于扩散的视觉基础模型用于高质量密集预测
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
September 26, 2024
作者: Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, Ying-Cong Chen
cs.AI
摘要
利用预训练的文本到图像扩散模型的视觉先验,为增强在密集预测任务中的零样本泛化提供了一个有前途的解决方案。然而,现有方法常常不加批判地使用原始的扩散公式,这可能并不是最佳选择,因为密集预测与图像生成之间存在根本差异。在本文中,我们对密集预测的扩散公式进行了系统分析,着重考虑质量和效率。我们发现,用于图像生成的原始参数化类型,即学习预测噪声的方式,对于密集预测是有害的;多步噪声/去噪扩散过程也是不必要的,并且很难优化。基于这些见解,我们引入了Lotus,这是一个基于扩散的视觉基础模型,具有简单而有效的适应协议,用于密集预测。具体而言,Lotus被训练为直接预测注释而不是噪声,从而避免有害的方差。我们还重新构建了扩散过程,将其简化为单步程序,简化了优化过程,显著提高了推理速度。此外,我们引入了一种称为细节保留者的新调整策略,实现了更准确和细粒度的预测。在不增加训练数据或模型容量的情况下,Lotus在各种数据集上实现了零样本深度和法线估计的最先进性能。它还显著提高了效率,比大多数现有的基于扩散的方法快几百倍。
English
Leveraging the visual priors of pre-trained text-to-image diffusion models
offers a promising solution to enhance zero-shot generalization in dense
prediction tasks. However, existing methods often uncritically use the original
diffusion formulation, which may not be optimal due to the fundamental
differences between dense prediction and image generation. In this paper, we
provide a systemic analysis of the diffusion formulation for the dense
prediction, focusing on both quality and efficiency. And we find that the
original parameterization type for image generation, which learns to predict
noise, is harmful for dense prediction; the multi-step noising/denoising
diffusion process is also unnecessary and challenging to optimize. Based on
these insights, we introduce Lotus, a diffusion-based visual foundation model
with a simple yet effective adaptation protocol for dense prediction.
Specifically, Lotus is trained to directly predict annotations instead of
noise, thereby avoiding harmful variance. We also reformulate the diffusion
process into a single-step procedure, simplifying optimization and
significantly boosting inference speed. Additionally, we introduce a novel
tuning strategy called detail preserver, which achieves more accurate and
fine-grained predictions. Without scaling up the training data or model
capacity, Lotus achieves SoTA performance in zero-shot depth and normal
estimation across various datasets. It also significantly enhances efficiency,
being hundreds of times faster than most existing diffusion-based methods.Summary
AI-Generated Summary