Dynamisch Segmentatie Diffusie Transformer

Samenvatting

Diffusion Transformers verwerken afbeeldingen als vaste-lengte sequenties van tokens die worden geproduceerd door een statische patchify-operatie. Hoewel effectief, besteedt dit ontwerp uniforme rekenkracht aan zowel regio's met weinig als veel informatie, waarbij wordt genegeerd dat afbeeldingen regio's met variërende detailrijkdom bevatten en dat het denoisingsproces verloopt van grove structuur in de vroege tijdstappen naar fijne details in de late tijdstappen. Wij introduceren de Dynamic Chunking Diffusion Transformer (DC-DiT), die de DiT-backbone uitbreidt met een geleerd encoder-router-decoder-scaffold dat de 2D-invoer adaptief comprimeert tot een kortere tokensequentie op een data-afhankelijke manier, gebruikmakend van een chunking-mechanisme dat end-to-end wordt aangeleerd met diffusietraining. Het mechanisme leert uniforme achtergrondregio's te comprimeren tot minder tokens en detailrijke regio's tot meer tokens, waarbij zinvolle visuele segmentaties ontstaan zonder expliciete supervisie. Bovendien leert het ook zijn compressie aan te passen over diffusietijdstappen heen, door minder tokens te gebruiken in ruisrijke stadia en meer tokens naarmate fijne details verschijnen. Op klasse-voorwaardelijke ImageNet 256×256 verbetert DC-DiT consistent de FID en Inception Score ten opzichte van zowel parameter-gelijke als FLOP-gelijke DiT-baselines bij 4× en 16× compressie, wat aantoont dat dit een veelbelovende techniek is met potentiële verdere toepassingen in pixel-ruimte, video- en 3D-generatie. Naast nauwkeurigheid is DC-DiT praktisch: het kan worden opgewaardeerd vanuit voorgetrainde DiT-checkpoints met minimale rekenkracht na de training (tot 8× minder trainingsstappen) en combineert met andere dynamische rekenmethoden om de generatie-FLOPs verder te verminderen.

English

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static patchify operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet 256{times}256, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across 4{times} and 16{times} compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to 8{times} fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.