얼굴 클로즈업 영상의 점프 컷 부드럽게 하기

초록

점프 컷은 시청 경험에 있어 갑작스럽고 때로는 원치 않는 변화를 제공합니다. 본 논문에서는 토킹 헤드(talking head) 비디오의 맥락에서 이러한 점프 컷을 부드럽게 만드는 새로운 프레임워크를 제시합니다. 우리는 비디오 내 다른 소스 프레임에서 주체의 외형을 활용하며, 이를 DensePose 키포인트와 얼굴 랜드마크에 기반한 중간 수준 표현과 융합합니다. 움직임을 구현하기 위해, 우리는 컷 주변의 시작 및 종료 프레임 간 키포인트와 랜드마크를 보간합니다. 이후 키포인트와 소스 프레임을 기반으로 이미지 변환 네트워크를 사용하여 픽셀을 합성합니다. 키포인트가 오류를 포함할 수 있으므로, 우리는 각 키포인트에 대해 여러 옵션 중 가장 적절한 소스를 선택하고 추출하기 위한 교차 모달(cross-modal) 주의 메커니즘을 제안합니다. 이러한 중간 수준 표현을 활용함으로써, 우리의 방법은 강력한 비디오 보간 기법보다 더 나은 결과를 달성할 수 있습니다. 우리는 토킹 헤드 비디오에서 필러 단어, 일시 정지, 심지어 무작위 컷과 같은 다양한 점프 컷에 대해 이 방법을 시연합니다. 실험 결과, 토킹 헤드가 점프 컷에서 회전하거나 급격히 움직이는 어려운 경우에도 원활한 전환을 달성할 수 있음을 보여줍니다.

English

A jump cut offers an abrupt, sometimes unwanted change in the viewing experience. We present a novel framework for smoothing these jump cuts, in the context of talking head videos. We leverage the appearance of the subject from the other source frames in the video, fusing it with a mid-level representation driven by DensePose keypoints and face landmarks. To achieve motion, we interpolate the keypoints and landmarks between the end frames around the cut. We then use an image translation network from the keypoints and source frames, to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select and pick the most appropriate source amongst multiple options for each key point. By leveraging this mid-level representation, our method can achieve stronger results than a strong video interpolation baseline. We demonstrate our method on various jump cuts in the talking head videos, such as cutting filler words, pauses, and even random cuts. Our experiments show that we can achieve seamless transitions, even in the challenging cases where the talking head rotates or moves drastically in the jump cut.

얼굴 클로즈업 영상의 점프 컷 부드럽게 하기

Jump Cut Smoothing for Talking Heads

초록

Support