SNAP：用於語音深度偽造檢測的基於說話人歸零的偽影投影技術

摘要

近期文本转语音技术的进步使得生成高保真合成语音已近乎与真人声音难以区分。虽然最新研究表明基于自监督学习的语音编码器在深度伪造检测方面具有效能，但这些模型难以对未见过的新说话人实现泛化。我们的定量分析表明，这些编码器表征显著受到说话人信息的影响，导致检测器利用说话人特异性关联而非伪影相关线索。我们将此现象称为说话人纠缠。为减弱这种依赖性，我们提出SNAP框架——一种说话人信息消除方法。通过估计说话人子空间并应用正交投影来抑制说话人相关成分，从而在残差特征中分离出合成伪影。通过降低说话人纠缠度，SNAP促使检测器聚焦于伪影相关模式，最终实现了最先进的检测性能。

English

Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.

SNAP：用於語音深度偽造檢測的基於說話人歸零的偽影投影技術

SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection

摘要

Support