展現訊號，隱藏雜訊：像素空間擴散的頻譜強制

摘要

像素空間擴散模型是基於全頻帶噪聲圖像進行訓練的，但去噪器可用的有效信號具有強烈的頻率依賴性。在校正流擴散與自然圖像冪律頻譜的條件下，每個時間點 t 的每頻段數據-噪聲輪廓 k^{*}(t) = (1-t)^{-2/α} 將承載信號的低頻區域與噪聲主導的高頻區域分隔開。我們證明，這種隱含的由粗到細結構不僅具有描述性：它還引發了容量分配問題。標準的像素空間去噪器必須在內部探索移動的頻帶邊界，並可能將計算資源耗費在那些最優預測退化為確定性基準——而非數據分佈建模——的頻率-時間區域。為了將此邊界明確化，我們引入了頻譜強制（Spectral Forcing），這是一種無參數、時間條件性的二維離散餘弦變換（2D-DCT）低通運算，應用於圖像塊嵌入器之前的噪聲輸入。其截止頻率隨擴散時間單調擴展，並在數據端點處退化為恆等映射。通過受控的合成實驗，我們確定了該運算有效的情景：粗糙的圖像塊分詞，以及高頻內容主要為噪聲而非必要信號的數據。在採用 JiT-700M/32 的 ImageNet-256 上，頻譜強制在不同訓練週期中 consistently 改善了 FID 和 Inception Score，展現出訓練全程的穩健增益；在更精細的分詞下，頻譜強制仍然具有競爭力。我們進一步將相同的運算插入 SenseNova-U1（一種統一文本到圖像模型）中，該模型在 DPG-Bench 和 GenEval 上取得提升，這表明輸入側的頻譜先驗能夠超越類別條件生成進行遷移。這些結果通過展示信號並隱藏噪聲，為實現容量高效的像素空間擴散提供了一條路徑。

English

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour k^{*}(t) = (1-t)^{-2/α} separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time t. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.