微调图像条件扩散模型比你想象的要容易

摘要

最近的研究表明，大型扩散模型可以通过将深度估计视为一项图像条件图像生成任务，被重新利用为高精度的单目深度估计器。虽然所提出的模型取得了最先进的结果，但由于多步推断带来的高计算需求限制了其在许多场景中的使用。在本文中，我们展示了感知效率低下是由推断流程中的一个缺陷导致的，这一点迄今为止尚未被注意到。修正后的模型在性能上与先前报告的最佳配置相媲美，而速度却快了200多倍。为了优化下游任务性能，我们在单步模型的基础上执行端到端微调，使用特定于任务的损失，得到一个胜过所有其他基于扩散的深度和法线估计模型的确定性模型，表现优异于常见的零样本基准测试。令人惊讶的是，我们发现这种微调协议也可直接应用于稳定扩散，并实现了与当前最先进的基于扩散的深度和法线估计模型相媲美的性能，质疑了一些先前研究得出的结论。

English

Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200times faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.

微调图像条件扩散模型比你想象的要容易

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

摘要

Support