MaskAlign: トークンサブセット表現アライメントによる効率的な拡散学習

要旨

事前学習済み視覚モデルとの表現アライメントは、拡散トランスフォーマーの訓練を加速する上で近年強い可能性を示している。既存手法は、拡散モデルの中間特徴量を自己教師あり視覚エンコーダからのクリーン画像表現と整列させることで、収束性と生成品質を向上させる。しかしながら、このようなアライメントは自明ではない制約も導入する。拡散モデルはタイムステップごとに利用可能な情報が変化するノイズ入力を扱う一方、参照特徴量はクリーン画像から抽出される。本論文では、このミスマッチをトークンレベルの観点から再検討する。我々は、全トークン表現アライメント下では、アライメント勾配ノルムが大きいトークンが安定した空間的選好を示すことを発見した。これは、アライメント目的関数が全てのトークンに一様に影響を与えるわけではなく、モデルがクリーン画像トークンの完全な集合に依存するよう促す可能性があることを示唆する。この問題に対処するため、我々はMaskAlignを提案する。これは訓練中にランダムにサンプリングされたトークン部分集合に対してアライメントを適用するトークン部分集合表現アライメント手法である。反復ごとに異なるトークン部分集合にモデルをさらすことで、MaskAlignは表現アライメントの完全なトークン集合への依存性を低減し、トークン部分集合の摂動に対してもより安定したアライメント動作を促す。トークンの直接的な削除による情報損失を軽減するため、我々はさらに、マスク前にトークン間で情報を共有する軽量なプレマスク・トークンミキシングブロックを導入する。

English

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.