频谱匹配：潜在扩散中卓越扩散性的统一视角

摘要

本文研究变分自编码器（VAE）在潜在扩散中的可扩散性（可学习性）。首先，我们证明采用均方误差目标的像素空间扩散存在固有偏差，会优先学习低频和中频空间分量，而自然图像的幂律功率谱密度（PSD）使得这种偏差在感知上具有益处。基于此发现，我们提出频谱匹配假说：具有优异可扩散性的潜在表征应满足（i）遵循平坦化的幂律功率谱分布（编码频谱匹配，ESM），以及（ii）通过解码器保持频率分量间的语义对应关系（解码频谱匹配，DSM）。实践中，我们通过匹配图像与潜在表征的功率谱密度实现ESM，并采用频域对齐重建的共享频谱掩码实现DSM。重要的是，频谱匹配提供了统一视角，既澄清了先前关于潜在表征过噪或过平滑的观察结果，又将多种近期方法（如VA-VAE、EQ-VAE）阐释为特例。实验表明，频谱匹配在CelebA和ImageNet数据集上实现了更优的扩散生成效果，且优于现有方法。最后，我们将频谱视角拓展至表征对齐（REPA）：证明目标表征的方向性频谱能量对REPA至关重要，并提出基于DoG的方法进一步提升REPA性能。代码已开源：https://github.com/forever208/SpectrumMatching。

English

In this paper, we study the diffusability (learnability) of variational autoencoders (VAE) in latent diffusion. First, we show that pixel-space diffusion trained with an MSE objective is inherently biased toward learning low and mid spatial frequencies, and that the power-law power spectral density (PSD) of natural images makes this bias perceptually beneficial. Motivated by this result, we propose the Spectrum Matching Hypothesis: latents with superior diffusability should (i) follow a flattened power-law PSD (Encoding Spectrum Matching, ESM) and (ii) preserve frequency-to-frequency semantic correspondence through the decoder (Decoding Spectrum Matching, DSM). In practice, we apply ESM by matching the PSD between images and latents, and DSM via shared spectral masking with frequency-aligned reconstruction. Importantly, Spectrum Matching provides a unified view that clarifies prior observations of over-noisy or over-smoothed latents, and interprets several recent methods as special cases (e.g., VA-VAE, EQ-VAE). Experiments suggest that Spectrum Matching yields superior diffusion generation on CelebA and ImageNet datasets, and outperforms prior approaches. Finally, we extend the spectral view to representation alignment (REPA): we show that the directional spectral energy of the target representation is crucial for REPA, and propose a DoG-based method to further improve the performance of REPA. Our code is available https://github.com/forever208/SpectrumMatching.

频谱匹配：潜在扩散中卓越扩散性的统一视角

Spectrum Matching: a Unified Perspective for Superior Diffusability in Latent Diffusion

摘要

Support