微调图像条件扩散模型比你想象的要容易
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
September 17, 2024
作者: Gonzalo Martin Garcia, Karim Abou Zeid, Christian Schmidt, Daan de Geus, Alexander Hermans, Bastian Leibe
cs.AI
摘要
最近的研究表明,大型扩散模型可以通过将深度估计视为一项图像条件图像生成任务,被重新利用为高精度的单目深度估计器。虽然所提出的模型取得了最先进的结果,但由于多步推断带来的高计算需求限制了其在许多场景中的使用。在本文中,我们展示了感知效率低下是由推断流程中的一个缺陷导致的,这一点迄今为止尚未被注意到。修正后的模型在性能上与先前报告的最佳配置相媲美,而速度却快了200多倍。为了优化下游任务性能,我们在单步模型的基础上执行端到端微调,使用特定于任务的损失,得到一个胜过所有其他基于扩散的深度和法线估计模型的确定性模型,表现优异于常见的零样本基准测试。令人惊讶的是,我们发现这种微调协议也可直接应用于稳定扩散,并实现了与当前最先进的基于扩散的深度和法线估计模型相媲美的性能,质疑了一些先前研究得出的结论。
English
Recent work showed that large diffusion models can be reused as highly
precise monocular depth estimators by casting depth estimation as an
image-conditional image generation task. While the proposed model achieved
state-of-the-art results, high computational demands due to multi-step
inference limited its use in many scenarios. In this paper, we show that the
perceived inefficiency was caused by a flaw in the inference pipeline that has
so far gone unnoticed. The fixed model performs comparably to the best
previously reported configuration while being more than 200times faster. To
optimize for downstream task performance, we perform end-to-end fine-tuning on
top of the single-step model with task-specific losses and get a deterministic
model that outperforms all other diffusion-based depth and normal estimation
models on common zero-shot benchmarks. We surprisingly find that this
fine-tuning protocol also works directly on Stable Diffusion and achieves
comparable performance to current state-of-the-art diffusion-based depth and
normal estimation models, calling into question some of the conclusions drawn
from prior works.Summary
AI-Generated Summary