FAM拡散：高解像度画像生成のための周波数と注意の調整による安定した拡散

要旨

拡散モデルは高品質な画像を生成するのに優れています。ただし、トレーニング時に使用された解像度でのみ効果的です。スケーリングされた解像度での推論は、繰り返しパターンや構造の歪みを引き起こします。高い解像度での再トレーニングはすぐに制約が生じます。したがって、既存の拡散モデルが柔軟なテスト時解像度で動作することを可能にする手法が非常に望ましいです。従来の研究は頻繁なアーティファクトに苦しんでおり、しばしば大きな遅延オーバーヘッドを導入しています。私たちは、これらの問題を解決するために組み合わせる2つのシンプルなモジュールを提案します。私たちは、グローバル構造の一貫性を向上させるためにフーリエ領域を活用する周波数変調（FM）モジュールと、従来の研究でほとんど無視されていた局所テクスチャパターンの一貫性を向上させるアテンション変調（AM）モジュールを導入します。私たちの手法であるFam拡散は、任意の潜在的な拡散モデルにシームレスに統合され、追加のトレーニングは必要ありません。包括的な質的結果は、私たちの手法が構造的および局所的なアーティファクトに対処する際の効果を示し、定量的な結果は最先端のパフォーマンスを示しています。また、私たちの手法は、パッチベースや段階的生成などの一貫性向上のための冗長な推論トリックを避け、無視できるほどの遅延オーバーヘッドをもたらします。

English

Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.

FAM拡散：高解像度画像生成のための周波数と注意の調整による安定した拡散

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

要旨

Support