ボトルネックサンプリングによるトレーニング不要の拡散加速

要旨

拡散モデルは視覚コンテンツ生成において顕著な能力を発揮していますが、推論時の高い計算コストにより、実際の展開には依然として課題が残っています。この計算負荷は主に、画像や映像の解像度に対する自己注意機構の二次的な複雑さに起因しています。既存の高速化手法は出力品質を犠牲にしたり、高コストな再学習を必要とすることが多いですが、我々はほとんどの拡散モデルが低解像度で事前学習されていることに着目しました。これにより、性能を低下させることなく、これらの低解像度の事前知識を活用して効率的な推論を行う機会が得られます。本研究では、低解像度の事前知識を活用して計算オーバーヘッドを削減しつつ、出力の忠実度を維持する、学習不要のフレームワークであるBottleneck Samplingを提案します。Bottleneck Samplingは、高-低-高のノイズ除去ワークフローを採用しています：初期段階と最終段階では高解像度でノイズ除去を行い、中間ステップでは低解像度で動作します。エイリアシングやぼやけのアーティファクトを軽減するために、解像度遷移点をさらに最適化し、各段階でのノイズ除去タイムステップを適応的にシフトさせます。Bottleneck Samplingを画像生成と映像生成の両タスクで評価し、広範な実験を通じて、画像生成では最大3倍、映像生成では最大2.5倍の推論速度向上を実現しつつ、複数の評価指標において標準的な全解像度サンプリングプロセスと同等の出力品質を維持できることを示しました。コードは以下で公開されています：https://github.com/tyfeld/Bottleneck-Sampling

English

Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3times for image generation and 2.5times for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics. Code is available at: https://github.com/tyfeld/Bottleneck-Sampling

ボトルネックサンプリングによるトレーニング不要の拡散加速

Training-free Diffusion Acceleration with Bottleneck Sampling

要旨

Support