两步物理：在视觉细化抹除运动先验之前将其锁定

摘要

图像到视频扩散模型利用输入图像生成视觉上惊艳的内容，但常产生违反物理规律的运动。我们发现了一个令人惊讶的现象：同一模型的2步生成往往比50步输出具有更好的物理一致性。通过频谱分析，我们将其归因于去噪过程中的相位侵蚀——相位从第2步到第50步显著下降（约下降18%），而幅度保持相对稳定。基于这一发现，我们提出了PhaseLock，一种无需训练的框架，可在整个去噪轨迹中保留少步推理的有效运动先验。PhaseLock不依赖全步推理来保证物理一致性，而是仅从2步中提取运动先验，并通过潜在增量引导（Latent Delta Guidance）将其施加到高保真生成上。该方法有效缓解了相位退化，在多种模型上将物理一致性平均提升6.2分，同时基本保持视觉保真度，且开销极小（1.06倍时间，1.02倍内存），减少了对昂贵外部引导方法（约5倍时间）的依赖。

English

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by approx 18% from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead (1.06times time, 1.02times memory) and reduced reliance on expensive external guidance methods (sim5times time).