MaskAlign: 효율적인 확산 훈련을 위한 토큰 서브셋 표현 정렬

초록

사전 훈련된 비전 모델과의 표현 정렬(representation alignment)이 최근 확산 트랜스포머(diffusion transformer) 학습을 가속화하는 데 강력한 잠재력을 보여주고 있다. 중간 확산 특징을 자기 지도 비전 인코더에서 추출한 깨끗한 이미지 표현과 정렬함으로써, 기존 방법들은 수렴 속도와 생성 품질을 개선한다. 그러나 이러한 정렬은 중요한 제약을 수반한다: 확산 모델은 시간 단계에 따라 유용한 정보량이 달라지는 잡음이 있는 입력을 다루는 반면, 참조 특징은 깨끗한 이미지에서 추출된다는 점이다. 본 논문에서는 이러한 불일치를 토큰 수준 관점에서 재검토한다. 전체 토큰 표현 정렬 하에서, 정렬 그래디언트 노름(alignment-gradient norm)이 큰 토큰들은 안정적인 공간적 선호도를 보이며, 이는 정렬 목적 함수가 모든 토큰에 균일하게 영향을 미치지 않으며 모델이 완전한 깨끗한 이미지 토큰 집합에 의존하도록 유도할 수 있음을 시사한다. 이 문제를 해결하기 위해, 학습 중 무작위로 샘플링된 토큰 부분집합에 정렬을 적용하는 토큰 부분집합 표현 정렬 방법인 MaskAlign을 제안한다. 다양한 반복에서 서로 다른 토큰 부분집합에 모델을 노출시킴으로써, MaskAlign은 완전한 토큰 집합에 대한 표현 정렬의 의존성을 줄이고 토큰 부분집합 변동 하에서 더 안정적인 정렬 행동을 장려한다. 토큰을 직접 제거함으로써 발생하는 정보 손실을 완화하기 위해, 마스킹 전에 토큰 간 정보를 공유하는 경량의 사전 마스크 토큰 혼합 블록(pre-mask token mixing block)을 추가로 도입한다.

English

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.