2-단계 물리학: 시각적 정제가 모션 사전 정보를 지우기 전에 이를 고정하기

초록

이미지-비디오 확산 모델은 입력 이미지를 활용하여 시각적으로 뛰어난 콘텐츠를 생성하지만, 자주 물리 법칙을 위반하는 움직임을 생성합니다. 우리는 놀라운 발견을 밝힙니다: 2단계 생성이 동일 모델의 50단계 출력보다 더 나은 물리적 일관성을 보이는 경우가 많다는 것입니다. 스펙트럼 분석을 통해, 이를 잡음 제거 과정 중 위상 침식에서 기인하는 것으로 추적합니다; 위상이 크게 저하되며(2단계에서 50단계로 갈 때 약 18% 감소), 반면에 크기는 상대적으로 안정적으로 유지됩니다. 이 통찰을 바탕으로, 우리는 PhaseLock을 제안합니다. 이는 훈련이 필요 없는 프레임워크로, 잡음 제거 궤적 전반에 걸쳐 소수 단계 추론의 유효한 움직임 사전 정보를 보존합니다. 물리적 일관성을 위해 전체 단계 추론에 의존하는 대신, PhaseLock은 단 2단계에서 움직임 사전 정보를 추출하고 이를 잠재 델타 가이던스(Latent Delta Guidance)를 통해 고충실도 생성에 적용합니다. 우리의 접근 방식은 위상 저하를 효과적으로 완화하며, 다양한 모델에서 물리적 일관성을 평균 6.2포인트 향상시키고 시각적 충실도를 대부분 유지하면서, 무시할 수 있는 오버헤드(시간 1.06배, 메모리 1.02배)를 가지며 값비싼 외부 가이던스 방법에 대한 의존성을 줄입니다(시간 약 5배).

English

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by approx 18% from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead (1.06times time, 1.02times memory) and reduced reliance on expensive external guidance methods (sim5times time).