Spectrum Matching: een Geünificeerd Perspectief voor Superieure Diffusiebaarheid in Latente Diffusie

Samenvatting

In dit artikel bestuderen we de diffuseerbaarheid (leerbaarheid) van variational autoencoders (VAE) in latente diffusie. Allereerst tonen we aan dat pixelruimte-diffusie, getraind met een MSE-doelfunctie, van nature geneigd is om lage en middelhoge ruimtelijke frequenties te leren, en dat de power-law spectrale dichtheid (PSD) van natuurlijke beelden deze neiging perceptueel gunstig maakt. Gemotiveerd door dit resultaat, stellen wij de Spectrum Matching Hypothese voor: latenten met superieure diffuseerbaarheid zouden (i) een afgevlakt power-law PSD moeten volgen (Encoding Spectrum Matching, ESM) en (ii) frequentie-naar-frequentie semantische correspondentie door de decoder moeten behouden (Decoding Spectrum Matching, DSM). In de praktijk passen we ESM toe door de PSD tussen beelden en latenten af te stemmen, en DSM via gedeelde spectrale maskering met frequentie-uitgelijnde reconstructie. Belangrijk is dat Spectrum Matching een verenigend perspectief biedt dat eerdere observaties van over-matig ruizige of over-matig gladgestreken latenten verklaart, en verschillende recente methoden interpreteert als speciale gevallen (bijv. VA-VAE, EQ-VAE). Experimenten suggereren dat Spectrum Matching superieure diffusie-generatie oplevert op de CelebA- en ImageNet-datasets, en eerdere benaderingen overtreft. Ten slotte breiden we het spectrale perspectief uit naar representatie-uitlijning (REPA): we tonen aan dat de directionele spectrale energie van de doelrepresentatie cruciaal is voor REPA, en stellen een op DoG gebaseerde methode voor om de prestaties van REPA verder te verbeteren. Onze code is beschikbaar op https://github.com/forever208/SpectrumMatching.

English

In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the Spectrum Matching Hypothesis: latents with superior diffusability should (i) follow a flattened power-law PSD (Encoding Spectrum Matching, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (Decoding Spectrum Matching, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available https://github.com/forever208/SpectrumMatching.

Spectrum Matching: een Geünificeerd Perspectief voor Superieure Diffusiebaarheid in Latente Diffusie

Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion

Samenvatting

Support