PixNerd: ピクセル神経場拡散

要旨

現在の拡散トランスフォーマーの成功は、事前学習された変分オートエンコーダ（VAE）によって形成された圧縮された潜在空間に大きく依存している。しかし、この二段階の学習パラダイムは、避けられない累積誤差とデコードアーティファクトを導入する。これらの問題に対処するため、研究者たちは複雑なカスケードパイプラインと増加したトークン複雑性を代償として、ピクセル空間に戻ることを選択している。彼らの取り組みとは対照的に、我々はニューラルフィールドを用いたパッチ単位のデコードをモデル化し、単一スケール、単一段階、効率的なエンドツーエンドの解決策を提案する。これをピクセルニューラルフィールド拡散（PixelNerd）と名付ける。PixNerdにおける効率的なニューラルフィールド表現のおかげで、我々は複雑なカスケードパイプラインやVAEを使用せずに、ImageNet 256×256で2.15のFID、ImageNet 512×512で2.84のFIDを直接達成した。また、我々のPixNerdフレームワークをテキストから画像への応用に拡張した。PixNerd-XXL/16は、GenEvalベンチマークで競争力のある0.73の総合スコア、DPGベンチマークで80.9の総合スコアを達成した。

English

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet 256times256 and 2.84 FID on ImageNet 512times512 without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

PixNerd: ピクセル神経場拡散

PixNerd: Pixel Neural Field Diffusion

要旨

Support