トーキングヘッドのためのジャンプカットスムージング

要旨

ジャンプカットは、視聴体験において急激で時に望ましくない変化をもたらします。本論文では、トーキングヘッド動画を対象とした、これらのジャンプカットを滑らかにする新しいフレームワークを提案します。私たちは、動画内の他のソースフレームから被写体の外観を活用し、DensePoseキーポイントと顔のランドマークによって駆動される中間表現と融合させます。動きを実現するために、カット周辺の終端フレーム間でキーポイントとランドマークを補間します。その後、キーポイントとソースフレームから画像変換ネットワークを使用してピクセルを合成します。キーポイントには誤差が含まれる可能性があるため、各キーポイントに対して複数の選択肢から最も適切なソースを選択するためのクロスモーダルアテンションスキームを提案します。この中間表現を活用することで、強力な動画補間ベースラインよりも優れた結果を達成できます。私たちは、フィラー言葉やポーズ、さらにはランダムなカットなど、トーキングヘッド動画における様々なジャンプカットに対して本手法を実証します。実験結果から、トーキングヘッドがジャンプカット中に回転したり大きく動いたりするような困難なケースにおいても、シームレスな遷移を実現できることが示されています。

English

A jump cut offers an abrupt, sometimes unwanted change in the viewing experience. We present a novel framework for smoothing these jump cuts, in the context of talking head videos. We leverage the appearance of the subject from the other source frames in the video, fusing it with a mid-level representation driven by DensePose keypoints and face landmarks. To achieve motion, we interpolate the keypoints and landmarks between the end frames around the cut. We then use an image translation network from the keypoints and source frames, to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select and pick the most appropriate source amongst multiple options for each key point. By leveraging this mid-level representation, our method can achieve stronger results than a strong video interpolation baseline. We demonstrate our method on various jump cuts in the talking head videos, such as cutting filler words, pauses, and even random cuts. Our experiments show that we can achieve seamless transitions, even in the challenging cases where the talking head rotates or moves drastically in the jump cut.

トーキングヘッドのためのジャンプカットスムージング

Jump Cut Smoothing for Talking Heads

要旨

Support