画像条件付き拡散モデルのファインチューニングは思っているよりも簡単です

要旨

最近の研究では、大規模な拡散モデルが、深度推定を画像条件付き画像生成タスクとしてキャストすることで、高精度な単眼深度推定器として再利用できることが示されました。提案されたモデルは最先端の結果を達成しましたが、多段推論による高い計算要求により、多くのシナリオでの使用が制限されていました。本論文では、これまで気付かれていなかった推論パイプラインの欠陥が効率の悪さの原因であることを示します。修正されたモデルは、これまで報告されていた最良の構成と比較して、200倍以上高速です。下流タスクのパフォーマンスを最適化するために、タスク固有の損失を持つ単一ステップモデルのエンドツーエンドのファインチューニングを行い、他のすべての拡散ベースの深度および法線推定モデルを一般的なゼロショットベンチマークで上回る確定的モデルを得ます。驚くべきことに、このファインチューニングプロトコルは、Stable Diffusionに直接適用され、現在の最先端の拡散ベースの深度および法線推定モデルと同等のパフォーマンスを達成し、先行研究から導かれたいくつかの結論に疑問を投げかけます。

English

Recent work showed that large diffusion models can be reused as highly precise monocular depth estimators by casting depth estimation as an image-conditional image generation task. While the proposed model achieved state-of-the-art results, high computational demands due to multi-step inference limited its use in many scenarios. In this paper, we show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. The fixed model performs comparably to the best previously reported configuration while being more than 200times faster. To optimize for downstream task performance, we perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models on common zero-shot benchmarks. We surprisingly find that this fine-tuning protocol also works directly on Stable Diffusion and achieves comparable performance to current state-of-the-art diffusion-based depth and normal estimation models, calling into question some of the conclusions drawn from prior works.

画像条件付き拡散モデルのファインチューニングは思っているよりも簡単です

Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

要旨

Support