Marigold：面向图像分析的扩散式生成器经济高效适配方案

摘要

过去十年间，深度学习在计算机视觉领域的成功，很大程度上依赖于大规模标注数据集和强大的预训练模型。在数据稀缺的场景下，这些预训练模型的质量对于有效的迁移学习至关重要。传统上，图像分类和自监督学习是预训练卷积神经网络（CNN）及基于Transformer架构的主要方法。近期，文本到图像生成模型的兴起，尤其是那些在潜在空间中使用去噪扩散技术的模型，引入了一类基于海量带标注图像数据集训练的基础模型。这些模型能够生成未见内容的逼真图像，表明它们对视觉世界有着深刻的理解。在本研究中，我们提出了Marigold，这是一系列条件生成模型及微调协议，旨在从如Stable Diffusion这样的预训练潜在扩散模型中提取知识，并将其适配于密集图像分析任务，包括单目深度估计、表面法线预测和本征分解。Marigold对预训练潜在扩散模型的架构改动极小，仅需在单个GPU上使用小型合成数据集训练数日，便展示了最先进的零样本泛化能力。项目页面：https://marigoldcomputervision.github.io

English

The success of deep learning in computer vision over the past decade has hinged on large labeled datasets and strong pretrained models. In data-scarce settings, the quality of these pretrained models becomes crucial for effective transfer learning. Image classification and self-supervised learning have traditionally been the primary methods for pretraining CNNs and transformer-based architectures. Recently, the rise of text-to-image generative models, particularly those using denoising diffusion in a latent space, has introduced a new class of foundational models trained on massive, captioned image datasets. These models' ability to generate realistic images of unseen content suggests they possess a deep understanding of the visual world. In this work, we present Marigold, a family of conditional generative models and a fine-tuning protocol that extracts the knowledge from pretrained latent diffusion models like Stable Diffusion and adapts them for dense image analysis tasks, including monocular depth estimation, surface normals prediction, and intrinsic decomposition. Marigold requires minimal modification of the pre-trained latent diffusion model's architecture, trains with small synthetic datasets on a single GPU over a few days, and demonstrates state-of-the-art zero-shot generalization. Project page: https://marigoldcomputervision.github.io

Marigold：面向图像分析的扩散式生成器经济高效适配方案

Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis

摘要

Support