D^3QE：離散分布の不一致を考慮した量子化誤差の学習による自己回帰生成画像検出

要旨

視覚的自動回帰（AR）モデルの出現は、画像生成に革命をもたらす一方で、合成画像検出に新たな課題を提示している。従来のGANや拡散ベースの手法とは異なり、ARモデルは離散的なトークン予測を通じて画像を生成し、画像合成品質の顕著な向上とベクトル量子化表現における独自の特性を示す。本論文では、実画像と偽画像に存在するコードブックの特徴的なパターンと頻度分布の偏りを利用した、自動回帰生成画像検出のための離散分布不一致認識量子化誤差（D^3QE）を提案する。動的コードブック頻度統計をその注意機構に統合し、意味的特徴と量子化誤差の潜在表現を融合する離散分布不一致認識トランスフォーマーを導入する。本手法を評価するため、7つの主要な視覚的ARモデルを網羅した包括的なデータセットARForensicsを構築した。実験結果は、D^3QEが異なるARモデルにわたって優れた検出精度と強力な汎化性能を示し、実世界の摂動に対する頑健性を有することを実証している。コードはhttps://github.com/Zhangyr2022/D3QE{https://github.com/Zhangyr2022/D3QE}で公開されている。

English

The emergence of visual autoregressive (AR) models has revolutionized image generation while presenting new challenges for synthetic image detection. Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error (D^3QE) for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent. To evaluate our method, we construct a comprehensive dataset termed ARForensics covering 7 mainstream visual AR models. Experiments demonstrate superior detection accuracy and strong generalization of D^3QE across different AR models, with robustness to real-world perturbations. Code is available at https://github.com/Zhangyr2022/D3QE{https://github.com/Zhangyr2022/D3QE}.