音声クローニングの積極的検出とローカル化された透かし技術

要旨

急速に進化する音声生成モデルの分野において、音声クローニングのリスクに対する音声の真正性を確保することが喫緊の課題となっています。本論文では、AI生成音声の局所的な検出に特化した初の音声透かし技術であるAudioSealを提案します。AudioSealは、サンプルレベルまでの局所的な透かし検出を可能にするために、ローカライゼーション損失と共に共同で学習されたジェネレータ/ディテクタアーキテクチャを採用し、さらに聴覚マスキングに着想を得た新しい知覚損失を用いることで、より優れた不可聴性を実現しています。AudioSealは、自動および人間による評価指標に基づいて、実生活での音声操作に対する頑健性と不可聴性の両面で最先端の性能を達成しています。さらに、AudioSealは高速なシングルパスディテクタを備えており、検出速度において既存モデルを大幅に上回り、最大で2桁の高速化を実現しています。これにより、大規模かつリアルタイムのアプリケーションに最適な技術となっています。

English

In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level, and a novel perceptual loss inspired by auditory masking, that enables AudioSeal to achieve better imperceptibility. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics. Additionally, AudioSeal is designed with a fast, single-pass detector, that significantly surpasses existing models in speed - achieving detection up to two orders of magnitude faster, making it ideal for large-scale and real-time applications.

音声クローニングの積極的検出とローカル化された透かし技術

Proactive Detection of Voice Cloning with Localized Watermarking

要旨

Support