SNAP: Spreker-onderdrukking voor Artefactprojectie bij Detectie van Deepfake-spraak

Samenvatting

Recente vooruitgang in tekst-naar-spraaktechnologieën maakt het mogelijk om hoogwaardige synthetische spraak te genereren die bijna niet van echte menselijke stemmen is te onderscheiden. Hoewel recente studies de effectiviteit aantonen van op zelf-toezicht leren gebaseerde spraakencoders voor deepfake-detectie, hebben deze modellen moeite met generaliseren naar onbekende sprekers. Onze kwantitatieve analyse suggereert dat deze encoderrepresentaties aanzienlijk worden beïnvloed door sprekersinformatie, waardoor detectoren sprekerspecifieke correlaties benutten in plaats van aan artefacten gerelateerde aanwijzingen. Wij noemen dit fenomeen sprekerverstrengeling. Om deze afhankelijkheid te verminderen, introduceren wij SNAP, een raamwerk voor sprekerneutralisatie. Wij schatten een sprekersdeelruimte en passen orthogonale projectie toe om sprekerafhankelijke componenten te onderdrukken, waardoor synthese-artefacten worden geïsoleerd in de residuele kenmerken. Door sprekerverstrengeling te verminderen, stimuleert SNAP detectoren om zich te richten op artefactgerelateerde patronen, wat leidt tot state-of-the-art prestaties.

English

Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.

SNAP: Spreker-onderdrukking voor Artefactprojectie bij Detectie van Deepfake-spraak

SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection

Samenvatting

Support