고품질 이미지-비디오 생성을 위한 튜닝 프리 노이즈 정제 기술

초록

이미지-투-비디오(I2V) 생성 작업은 항상 개방된 도메인에서 높은 충실도를 유지하는 데 어려움을 겪습니다. 전통적인 이미지 애니메이션 기술은 주로 얼굴이나 인간 자세와 같은 특정 도메인에 초점을 맞추기 때문에 개방된 도메인으로 일반화하기 어렵습니다. 최근 디퓨전 모델을 기반으로 한 여러 I2V 프레임워크가 개방 도메인 이미지에 대한 동적 콘텐츠를 생성할 수 있지만 충실도를 유지하지 못합니다. 우리는 낮은 충실도의 두 가지 주요 요인이 이미지 세부 사항의 손실과 노이즈 제거 과정에서의 노이즈 예측 편향임을 발견했습니다. 이를 위해, 우리는 주류 비디오 디퓨전 모델에 적용할 수 있는 효과적인 방법을 제안합니다. 이 방법은 더 정확한 이미지 정보 보충과 노이즈 보정을 기반으로 높은 충실도를 달성합니다. 구체적으로, 주어진 이미지에 대해 우리의 방법은 먼저 입력 이미지 잠재 공간에 노이즈를 추가하여 더 많은 세부 사항을 유지한 다음, 노이즈 예측 편향을 완화하기 위해 적절한 보정을 통해 노이즈가 있는 잠재 공간을 제거합니다. 우리의 방법은 튜닝이 필요 없고 플러그 앤 플레이 방식입니다. 실험 결과는 우리의 접근 방식이 생성된 비디오의 충실도를 향상시키는 데 효과적임을 보여줍니다. 더 많은 이미지-투-비디오 생성 결과는 프로젝트 웹사이트(https://noise-rectification.github.io)를 참조하십시오.

English

Image-to-video (I2V) generation tasks always suffer from keeping high fidelity in the open domains. Traditional image animation techniques primarily focus on specific domains such as faces or human poses, making them difficult to generalize to open domains. Several recent I2V frameworks based on diffusion models can generate dynamic content for open domain images but fail to maintain fidelity. We found that two main factors of low fidelity are the loss of image details and the noise prediction biases during the denoising process. To this end, we propose an effective method that can be applied to mainstream video diffusion models. This method achieves high fidelity based on supplementing more precise image information and noise rectification. Specifically, given a specified image, our method first adds noise to the input image latent to keep more details, then denoises the noisy latent with proper rectification to alleviate the noise prediction biases. Our method is tuning-free and plug-and-play. The experimental results demonstrate the effectiveness of our approach in improving the fidelity of generated videos. For more image-to-video generated results, please refer to the project website: https://noise-rectification.github.io.

고품질 이미지-비디오 생성을 위한 튜닝 프리 노이즈 정제 기술

Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation

초록

Support