2ステップの物理：視覚的精緻化による消去に先立つモーションプライアの固定

要旨

画像から動画への拡散モデルは、入力画像を活用して視覚的に秀逸なコンテンツを生成する一方、物理法則に反する動きを頻繁に生み出す。我々は驚くべき発見を明らかにする：同一モデルにおいて、2ステップの生成が50ステップの出力よりも優れた物理的整合性を示すことが多い。スペクトル解析を通じて、これをノイズ除去中の位相侵食に起因するものと特定した。位相はステップ2から50にかけて約18%低下するほど著しく劣化するのに対し、振幅は比較的安定している。この知見に基づき、我々はPhaseLockを提案する。これは学習不要のフレームワークであり、ノイズ除去の軌跡全体にわたって少数ステップ推論からの有効な動きの事前分布を保持する。PhaseLockは物理的整合性のために全ステップ推論に依存する代わりに、わずか2ステップから動きの事前分布を抽出し、それを潜在デルタガイダンスを介して高忠実度生成に適用する。本手法は位相劣化を効果的に軽減し、多様なモデルにおいて物理的整合性を平均6.2ポイント向上させつつ、視覚的忠実度をほぼ維持し、オーバーヘッドを無視できる程度（時間1.06倍、メモリ1.02倍）に抑え、高価な外部ガイダンス手法への依存を低減する（約5倍の時間短縮）。

English

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by approx 18% from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead (1.06times time, 1.02times memory) and reduced reliance on expensive external guidance methods (sim5times time).