MaskAlign：面向高效扩散训练的令牌子集表示对齐

摘要

与预训练视觉模型的表示对齐最近在加速扩散Transformer训练中展现出强大潜力。通过将扩散模型的中间特征与自监督视觉编码器从干净图像中提取的表示对齐，现有方法改善了收敛速度和生成质量。然而，这种对齐也引入了一个不可忽视的约束：扩散模型处理的是带噪输入，其可用信息随不同时间步而变化，而参考特征却提取自干净图像。在本文中，我们从令牌级视角重新审视了这一不匹配问题。我们发现，在全令牌表示对齐下，具有较大对齐梯度范数的令牌表现出稳定的空间偏好，这表明对齐目标并非均匀地影响所有令牌，且可能促使模型依赖完整的干净图像令牌集。为解决这一问题，我们提出MaskAlign，一种令牌子集表示对齐方法，该方法在训练过程中对随机采样的令牌子集施加对齐。通过让模型在多次迭代中接触不同的令牌子集，MaskAlign降低了表示对齐对完整令牌集的依赖，并鼓励对齐行为在令牌子集扰动下变得更稳定。为了缓解直接丢弃令牌所造成的信息损失，我们进一步引入了一个轻量级的掩码前令牌混合模块，该模块在掩码操作前跨令牌共享信息。

English

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.