KeySync: Een robuuste aanpak voor lekvrije lipsynchronisatie in hoge resolutie

Samenvatting

Lip-synchronisatie, bekend als de taak om lipbewegingen in een bestaande video af te stemmen op nieuwe invoeraudio, wordt doorgaans gezien als een eenvoudigere variant van audio-gestuurde gezichtsanimatie. Echter, naast de gebruikelijke problemen bij het genereren van pratende hoofden (bijvoorbeeld temporele consistentie), brengt lip-synchronisatie aanzienlijke nieuwe uitdagingen met zich mee, zoals expressielekkage vanuit de invoervideo en gezichtsocclusies, die een grote impact kunnen hebben op real-world toepassingen zoals geautomatiseerd nasynchroniseren, maar die vaak worden verwaarloosd in bestaande werken. Om deze tekortkomingen aan te pakken, presenteren we KeySync, een tweestapsframework dat erin slaagt het probleem van temporele consistentie op te lossen, terwijl het ook oplossingen biedt voor lekkage en occlusies door middel van een zorgvuldig ontworpen maskeringsstrategie. We laten zien dat KeySync state-of-the-art resultaten behaalt in lipreconstructie en cross-synchronisatie, waarbij de visuele kwaliteit wordt verbeterd en expressielekkage wordt verminderd volgens LipLeak, onze nieuwe lekkagemetriek. Bovendien demonstreren we de effectiviteit van onze nieuwe maskeringsaanpak bij het omgaan met occlusies en valideren we onze architectonische keuzes door middel van verschillende ablatiestudies. Code en modelgewichten zijn te vinden op https://antonibigata.github.io/KeySync.

English

Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in talking head generation (e.g., temporal consistency), lip synchronization presents significant new challenges such as expression leakage from the input video and facial occlusions, which can severely impact real-world applications like automated dubbing, but are often neglected in existing works. To address these shortcomings, we present KeySync, a two-stage framework that succeeds in solving the issue of temporal consistency, while also incorporating solutions for leakage and occlusions using a carefully designed masking strategy. We show that KeySync achieves state-of-the-art results in lip reconstruction and cross-synchronization, improving visual quality and reducing expression leakage according to LipLeak, our novel leakage metric. Furthermore, we demonstrate the effectiveness of our new masking approach in handling occlusions and validate our architectural choices through several ablation studies. Code and model weights can be found at https://antonibigata.github.io/KeySync.

KeySync: Een robuuste aanpak voor lekvrije lipsynchronisatie in hoge resolutie

KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

Samenvatting

Support