ArtifactNet：基于物理残留痕迹的AI生成音乐检测技术

摘要

我们提出ArtifactNet——一种轻量级框架，通过将AI生成音乐检测问题重构为取证物理学任务，专门提取并分析神经音频编解码器在生成音频中必然遗留的物理痕迹。该框架采用有界掩码UNet（ArtifactUNet，360万参数）从幅度谱中提取编解码残差，再通过HPSS分解为7通道取证特征，最终由紧凑型CNN（40万参数，总计400万参数）进行分类。我们同步推出ArtifactBench多生成器评估基准，包含6,183条音轨（4,383条AI生成音轨来自22种生成器，1,800条真实音轨来自6个不同来源），每条音轨均标注bench_origin标签以实现公平零样本评估。在未见测试集（n=2,263）上，ArtifactNet的F1分数达0.9829（误报率1.49%），显著优于相同测试条件下使用公开权重的CLAM（F1=0.7576，误报率69.26%）和SpecTTTra（F1=0.7713，误报率19.43%）。通过编解码感知训练（WAV/MP3/AAC/Opus四维数据增强），跨编解码器概率漂移降低83%（Δ从0.95降至0.16），成功解决了主要编解码器不变性失效问题。这些结果表明，取证物理学方法——直接提取编解码器层级痕迹——相比表征学习具有更优的泛化能力和参数效率，参数量较CLAM减少49倍，较SpecTTTra减少4.8倍。

English

We present ArtifactNet, a lightweight framework that detects AI-generated music by reframing the problem as forensic physics -- extracting and analyzing the physical artifacts that neural audio codecs inevitably imprint on generated audio. A bounded-mask UNet (ArtifactUNet, 3.6M parameters) extracts codec residuals from magnitude spectrograms, which are then decomposed via HPSS into 7-channel forensic features for classification by a compact CNN (0.4M parameters; 4.0M total). We introduce ArtifactBench, a multi-generator evaluation benchmark comprising 6,183 tracks (4,383 AI from 22 generators and 1,800 real from 6 diverse sources). Each track is tagged with bench_origin for fair zero-shot evaluation. On the unseen test partition (n=2,263), ArtifactNet achieves F1 = 0.9829 with FPR = 1.49%, compared to CLAM (F1 = 0.7576, FPR = 69.26%) and SpecTTTra (F1 = 0.7713, FPR = 19.43%) evaluated under identical conditions with published checkpoints. Codec-aware training (4-way WAV/MP3/AAC/Opus augmentation) further reduces cross-codec probability drift by 83% (Delta = 0.95 -> 0.16), resolving the primary codec-invariance failure mode. These results establish forensic physics -- direct extraction of codec-level artifacts -- as a more generalizable and parameter-efficient paradigm for AI music detection than representation learning, using 49x fewer parameters than CLAM and 4.8x fewer than SpecTTTra.