FlowVid: 불완전한 광학 흐름을 제어하여 일관된 비디오 간 합성 달성

초록

디퓨전 모델은 이미지 간 합성(image-to-image, I2I) 분야를 혁신적으로 변화시켰으며, 이제는 비디오 분야로도 확장되고 있습니다. 그러나 비디오 간 합성(video-to-video, V2V)의 발전은 비디오 프레임 간의 시간적 일관성을 유지하는 문제로 인해 지연되어 왔습니다. 본 논문은 소스 비디오 내의 공간적 조건과 시간적 광학 흐름(optical flow) 단서를 함께 활용하여 일관된 V2V 합성 프레임워크를 제안합니다. 기존 방법들이 광학 흐름을 엄격히 따르는 것과 달리, 우리의 접근 방식은 광학 흐름 추정의 불완전성을 처리하면서도 그 장점을 활용합니다. 우리는 첫 번째 프레임으로부터 워핑(warping)을 통해 광학 흐름을 인코딩하고, 이를 디퓨전 모델의 보조 참조로 사용합니다. 이를 통해 우리의 모델은 기존의 I2I 모델을 사용해 첫 번째 프레임을 편집한 후, 이를 연속적인 프레임으로 전파하여 비디오 합성을 가능하게 합니다. 우리의 V2V 모델인 FlowVid는 다음과 같은 뛰어난 특성을 보여줍니다: (1) 유연성: FlowVid는 기존의 I2I 모델과 원활하게 작동하며, 스타일화, 객체 교체, 지역 편집 등 다양한 수정을 용이하게 합니다. (2) 효율성: 30 FPS 및 512x512 해상도의 4초 길이 비디오 생성에 소요되는 시간은 단 1.5분으로, 이는 CoDeF, Rerender, TokenFlow에 비해 각각 3.1배, 7.2배, 10.5배 빠른 속도입니다. (3) 고품질: 사용자 연구에서 우리의 FlowVid는 45.7%의 선호도를 기록하며, CoDeF(3.5%), Rerender(10.2%), TokenFlow(40.4%)를 능가했습니다.

English

Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However, the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow, our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models, facilitating various modifications, including stylization, object swaps, and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF, Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender (10.2%), and TokenFlow (40.4%).

FlowVid: 불완전한 광학 흐름을 제어하여 일관된 비디오 간 합성 달성

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

초록

Support