SNAP：语音深度伪造检测中的伪影投影消音技术

摘要

近期文本转语音技术的突破使得生成高保真合成语音已近乎与真人嗓音无法区分。尽管最新研究表明基于自监督学习的语音编码器在深度伪造检测方面具有效能，但这些模型难以对未见过的新说话人实现泛化。我们的定量分析表明，这些编码器表征显著受到说话人信息的影响，导致检测器过度依赖说话人特定关联而非伪造痕迹线索。我们将此现象称为"说话人纠缠"。为削弱这种依赖性，我们提出SNAP框架——一种说话人信息消除技术。通过估计说话人子空间并应用正交投影来抑制说话人相关成分，从而在残差特征中分离出合成痕迹。通过降低说话人纠缠度，SNAP引导检测器聚焦于伪造痕迹相关模式，最终实现了最先进的检测性能。

English

Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.

SNAP：语音深度伪造检测中的伪影投影消音技术

SNAP: Speaker Nulling for Artifact Projection in Speech Deepfake Detection

摘要

Support