動的チャンク拡散トランスフォーマー

要旨

拡散トランスフォーマーは、静的なパッチ化操作によって生成された固定長のトークン系列として画像を処理する。この設計は有効ではあるが、画像には様々な詳細度の領域が存在すること、およびノイズ除去プロセスが初期タイムステップでは粗い構造から後期タイムステップでは細かい詳細へと進行することを無視し、低情報領域と高情報領域に均一な計算リソースを費やしている。本研究では、Dynamic Chunking Diffusion Transformer（DC-DiT）を提案する。DC-DiTは、拡散訓練とエンドツーエンドで学習されたチャンキング機構を用いて、2D入力をデータ依存的な方法で短いトークン列に適応的に圧縮する、学習可能なエンコーダ-ルータ-デコーダのスキャフォールドをDiTバックボーンに追加する。この機構は、均一な背景領域はより少ないトークンに、詳細豊富な領域はより多くのトークンに圧縮することを学習し、明示的な教師監督なしに意味のある視覚的セグメンテーションが出現する。さらに、拡散タイムステップを跨いで圧縮率を適応させることも学習し、ノイズの多い段階ではより少ないトークンを、細部が現れる段階ではより多くのトークンを使用する。クラス条件付きImageNet 256×256において、DC-DiTは、4倍および16倍の圧縮率で、パラメータ数が同等およびFLOPsが同等の両方のDiTベースラインに対して、FIDとInception Scoreを一貫して改善し、これがピクセル空間、ビデオ、3D生成へのさらなる応用の可能性を秘めた有望な技術であることを示す。精度に加えて、DC-DiTは実用的である：事前学習済みDiTチェックポイントから最小限の事後学習計算（最大8倍少ない訓練ステップ）でアップサイクル可能であり、他の動的計算手法と組み合わせることで生成FLOPsをさらに削減できる。

English

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static patchify operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet 256{times}256, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across 4{times} and 16{times} compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to 8{times} fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.