WiT：基于轨迹冲突导航的航点扩散变换器

摘要

尽管近期流匹配模型通过直接在像素空间操作避免了潜在自编码器的重建瓶颈，但像素流形中语义连续性的缺失导致最优传输路径严重纠缠。这会在路径交叉点附近引发剧烈的轨迹冲突，从而产生次优解。我们并未通过有信息损失的潜在表征来规避该问题，而是提出路径点扩散变压器（WiT）直接解耦像素空间轨迹。WiT通过预训练视觉模型投影的中间语义路径点对连续向量场进行因子分解，将最优传输拆分为先验到路径点和路径点到像素的两段式路径，有效解耦生成轨迹。具体而言，在迭代去噪过程中，轻量级生成器根据当前含噪状态动态推断这些中间路径点，随后通过仅像素自适应层归一化机制持续调节主扩散变压器的条件，引导其向下一状态演化，最终生成RGB像素。在ImageNet 256×256数据集上的评估表明，WiT超越了强像素空间基线，将即时训练收敛速度提升2.2倍。代码已公开于https://github.com/hainuo-wang/WiT.git。

English

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.