WiT: 軌道競合ナビゲーションによるウェイポイント拡散トランスフォーマー

要旨

近年のFlow Matchingモデルは、画素空間で直接動作することで潜在オートエンコーダの再構成ボトルネックを回避しているが、画素多様体における意味的連続性の欠如により、最適輸送経路が深刻に絡み合っている。これにより、交差点付近で軌道衝突が頻発し、次善の解が導かれる。情報損失を伴う潜在表現による回避ではなく、我々はWaypoint Diffusion Transformers（WiT）を提案し、画素空間軌道の直接的な解きほぐしを実現する。WiTは、事前学習済み視覚モデルから投影された中間的な意味的waypointを介して連続ベクトル場を分解する。これにより、最適輸送を事前分布-waypoint間とwaypoint-画素間のセグメントに分割することで、生成軌道を効果的に分離する。具体的には、反復的なノイズ除去プロセスにおいて、軽量なジェネレータが現在のノイズ状態からこれらの中間waypointを動的に推論する。それらはJust-Pixel AdaLN機構を介して主要な拡散トランスフォーマーを継続的に条件付けし、進化を次の状態へと導き、最終的に最終的なRGB画素を生成する。ImageNet 256x256での評価では、WiTは強力な画素空間ベースラインを上回り、JiT訓練の収束を2.2倍加速させた。コードはhttps://github.com/hainuo-wang/WiT.git で公開予定である。

English

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.

WiT: 軌道競合ナビゲーションによるウェイポイント拡散トランスフォーマー

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

要旨

Support