WiT：基于轨迹冲突导航的航点扩散变换器

摘要

尽管近期流匹配模型通过直接在像素空间操作规避了潜在自编码器的重建瓶颈，但像素流形中语义连续性的缺失导致最优传输路径严重纠缠。这会在路径交汇点引发剧烈的轨迹冲突，从而产生次优解。我们并未采用有信息损耗的潜在表示来回避该问题，而是通过提出路径点扩散变压器（WiT）直接解构像素空间轨迹。WiT通过从预训练视觉模型投影的语义路径点对连续向量场进行因式分解，将最优传输拆分为先验-路径点和路径点-像素两个阶段，有效解耦生成轨迹。具体而言，在迭代去噪过程中，轻量级生成器根据当前含噪状态动态推断中间路径点，随后通过像素自适应层归一化机制持续调节主扩散变压器的演化方向，逐步导向下一状态，最终生成RGB像素。在ImageNet 256×256数据集上的评估表明，WiT超越了现有像素空间基线模型，并将即时训练收敛速度提升2.2倍。代码已开源於https://github.com/hainuo-wang/WiT.git。

English

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.

WiT：基于轨迹冲突导航的航点扩散变换器

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

摘要

Support