凸显信号，隐藏噪声：像素空间扩散中的频谱强制

摘要

像素空间扩散模型是在全带宽含噪图像上训练的，但去噪器可用的有效信号强烈依赖于频率。在修正流扩散和自然图像幂律谱下，每个时间t的逐频带数据噪声比等高线k^{*}(t) = (1-t)^{-2/α}将承载信号的低频区域与噪声主导的高频区域分隔开来。我们证明，这种隐式的由粗到细结构并非仅仅是描述性的：它引发了容量分配问题。标准的像素空间去噪器必须内部发现移动的带宽边界，并且可能将计算花费在频率-时间区域上，而在此类区域中，最优预测退化为确定性基线，而非数据分布建模。为使这一边界显式化，我们引入频谱强制（Spectral Forcing），这是一种无参数、时间条件性的2D-DCT低通算子，在分块嵌入器之前应用于含噪输入。其截止频率随扩散时间单调扩展，并在数据端点处变为恒等映射。通过受控的合成实验，我们确定了该算子有益的适用场景：粗粒度的分块分词化，以及数据的高频内容主要是噪声而非关键信号的情况。在ImageNet-256上使用JiT-700M/32时，频谱强制在不同训练周期中均一致地改进了FID和Inception Score，展示了训练过程中的稳健增益；在更细粒度的分词化下，频谱强制仍具有竞争力。我们进一步将未修改的算子插入SenseNova-U1（一种统一的文本到图像模型）中，它改进了DPG-Bench和GenEval，表明输入侧的频谱先验可迁移至类别条件生成之外。这些结果表明，通过展示信号并隐藏噪声，可以为容量高效的像素空间扩散提供一条路径。

English

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour k^{*}(t) = (1-t)^{-2/α} separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time t. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.