동적 청킹 확산 트랜스포머

초록

확산 트랜스포머(Diffusion Transformers)는 고정된 패치화(patchify) 연산을 통해 생성된 고정 길이 토큰 시퀀스로 이미지를 처리합니다. 이 방식은 효과적이지만, 이미지가 다양한 세부 정보를 가진 영역으로 구성되어 있고 잡음 제거(denoising) 과정이 초기 시간 단계에서는 coarse한 구조에서 후기 시간 단계에서는 미세한 세부 사항으로 진행된다는 점을 간과한 채, 정보가 적은 영역과 많은 영역에 동일한 계산 자원을 균일하게 소모합니다. 우리는 Dynamic Chunking Diffusion Transformer(DC-DiT)를 소개합니다. DC-DiT는 DiT 백본에 학습된 인코더-라우터-디코더 구조(scaffold)를 추가하여, 확산 훈련과 함께 end-to-end로 학습된 청킹(chunking) 메커니즘을 통해 2D 입력을 데이터 의존적 방식으로 더 짧은 토큰 시퀀스에 적응적으로 압축합니다. 이 메커니즘은 균일한 배경 영역은 더 적은 토큰으로, 세부 정보가 풍부한 영역은 더 많은 토큰으로 압축하는 방법을 학습하며, 명시적인 지도 없이도 의미 있는 시각적 분할(segmentation)이 나타납니다. 더 나아가, 이 메커니즘은 확산 시간 단계에 걸쳐 압축 방식을 적응적으로調整합니다. 즉, 잡음이 많은 단계에서는 더 적은 토큰을 사용하고 미세한 세부 사항이 나타나는 단계에서는 더 많은 토큰을 사용하도록 학습합니다. 클래스 조건부 ImageNet 256×256 생성 작업에서 DC-DiT는 4배 및 16배 압축 시나리오에서 파라미터 규모가 동일하거나 FLOPs가 동일한 DiT 기준 모델 대비 FID와 Inception Score를 지속적으로 향상시켜, 이 기술이 유망하며 픽셀 공간, 비디오, 3D 생성에 대한 추가 적용 가능성이 있음을 보여줍니다. 정확도 향상 외에도 DC-DiT는 실용적입니다. 사전 훈련된 DiT 체크포인트에서 최소한의 사후 훈련 계산량(최대 8배 적은 훈련 스텝)으로 업사이클(upcycle)할 수 있으며, 다른 동적 계산 방법과 결합하여 생성 시 필요한 FLOPs를 추가로 줄일 수 있습니다.

English

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static patchify operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet 256{times}256, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across 4{times} and 16{times} compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to 8{times} fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.