Toon het signaal, verberg de ruis: spectrale forcering voor pixelruimte-diffusie

Samenvatting

Pixel-ruimte diffusiemodellen worden getraind op volledige bandbreedte ruisachtige afbeeldingen, maar het bruikbare signaal dat beschikbaar is voor de denoiser is sterk frequentieafhankelijk. Onder rectified-flow diffusie en natuurlijke-beeld machtswetspectra scheidt de per-band data-tot-ruis contour \(k^{*}(t) = (1-t)^{-2/\alpha}\) een signaaldragend laagfrequent gebied van een ruisgedomineerd hoogfrequent gebied op elk tijdstip \(t\). We tonen aan dat deze impliciete grof-naar-fijn structuur niet slechts beschrijvend is: het induceert een capaciteitstoewijzingsprobleem. Een standaard pixel-ruimte denoiser moet de bewegende bandbreedtegrens intern ontdekken en kan rekenkracht besteden aan frequentie-tijd gebieden waar de optimale voorspelling terugvalt op deterministische basislijnen in plaats van datadistributiemodellering. Om deze grens expliciet te maken, introduceren we Spectral Forcing, een parameter-vrije, tijdsconditionele 2D-DCT laagdoorlaatoperator die wordt toegepast op de ruisachtige invoer vóór de patch-embedder. De afsnijfrequentie ervan neemt monotoon toe met de diffusietijd en wordt de identiteit op het data-eindpunt. Door middel van gecontroleerde synthetische experimenten identificeren we het regime waarin de operator gunstig is: grove patch-tokenisatie en data waarvan de hoogfrequente inhoud voornamelijk ruis is in plaats van essentieel signaal. Op ImageNet-256 met JiT-700M/32 verbetert Spectral Forcing consequent zowel FID als Inception Score over verschillende trainingsepochs, wat robuuste winsten gedurende de training aantoont; bij fijnere tokenisatie blijft de spectral forcing nog steeds concurrerend. We voegen de ongewijzigde operator verder in in SenseNova-U1, een uniform tekst-naar-beeld model, waar het DPG-Bench en GenEval verbetert, wat aantoont dat de invoerzijde spectrale prior verder reikt dan klasse-conditionele generatie. Deze resultaten suggereren een pad naar capaciteitsefficiënte pixel-ruimte diffusie door het signaal te tonen en de ruis te verbergen.

English

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour k^{*}(t) = (1-t)^{-2/α} separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time t. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.