兩步驟物理：在視覺細化抹除運動先驗之前將其鎖定

摘要

图到视频扩散模型利用输入图像生成视觉惊艳的内容，但常常产生违背物理规律的运动。我们揭示了一个令人惊讶的发现：同一模型的2步生成结果往往比50步输出具有更好的物理一致性。通过频谱分析，我们将此归因于去噪过程中的相位侵蚀——相位从第2步到第50步显著下降约18%，而幅度保持相对稳定。基于这一洞察，我们提出PhaseLock，一种无需训练的框架，能够在整个去噪轨迹中保留少步推理的有效运动先验。PhaseLock并非依赖全步推理来保证物理一致性，而是仅从2步中提取运动先验，并通过潜在增量引导（Latent Delta Guidance）将其施加到高保真生成上。我们的方法有效缓解了相位退化，在多种模型上将物理一致性平均提升6.2个点，同时基本保持视觉保真度，且开销极小（时间1.06倍，内存1.02倍），并减少了对昂贵外部引导方法（约5倍时间）的依赖。

English

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by approx 18% from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead (1.06times time, 1.02times memory) and reduced reliance on expensive external guidance methods (sim5times time).