TeleAntiFraud-28k:一個用於電信詐騙檢測的語音-文本慢思考數據集
TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
March 31, 2025
作者: Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
cs.AI
摘要
電信詐騙的檢測面臨重大挑戰,主要源於缺乏整合音頻信號與推理導向文本分析的高質量多模態訓練數據。為填補這一空白,我們推出了TeleAntiFraud-28k,這是首個專為自動化電信詐騙分析設計的開源音頻-文本慢思考數據集。我們的數據集通過三種策略構建:(1) 使用自動語音識別(ASR)轉錄的通話錄音(原始音頻已匿名化)生成隱私保護的文本真實樣本,並通過文本轉語音(TTS)模型再生確保與現實世界的一致性;(2) 基於大型語言模型(LLM)的自指導採樣對真實ASR輸出進行語義增強,以擴展場景覆蓋範圍;(3) 多智能體對抗合成,通過預定義的通信場景和詐騙類型模擬新興詐騙手法。生成的數據集包含28,511個經過嚴格處理的語音-文本對,並附有詳細的詐騙推理註釋。數據集劃分為三個任務:場景分類、詐騙檢測、詐騙類型分類。此外,我們構建了TeleAntiFraud-Bench,這是一個標準化的評估基準,包含從數據集中按比例採樣的實例,以促進對電信詐騙檢測任務模型性能的系統測試。我們還貢獻了一個基於混合真實/合成數據訓練的生產優化監督微調(SFT)模型,並開源了數據處理框架,以支持社區驅動的數據集擴展。這項工作為多模態反詐騙研究建立了基礎框架,同時解決了數據隱私和場景多樣性方面的關鍵挑戰。項目將發佈於https://github.com/JimmyMa99/TeleAntiFraud。
English
The detection of telecom fraud faces significant challenges due to the lack
of high-quality multimodal training data that integrates audio signals with
reasoning-oriented textual analysis. To address this gap, we present
TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset
specifically designed for automated telecom fraud analysis. Our dataset is
constructed through three strategies: (1) Privacy-preserved text-truth sample
generation using automatically speech recognition (ASR)-transcribed call
recordings (with anonymized original audio), ensuring real-world consistency
through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via
large language model (LLM)-based self-instruction sampling on authentic ASR
outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that
simulates emerging fraud tactics through predefined communication scenarios and
fraud typologies. The generated dataset contains 28,511 rigorously processed
speech-text pairs, complete with detailed annotations for fraud reasoning. The
dataset is divided into three tasks: scenario classification, fraud detection,
fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a
standardized evaluation benchmark comprising proportionally sampled instances
from the dataset, to facilitate systematic testing of model performance on
telecom fraud detection tasks. We also contribute a production-optimized
supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while
open-sourcing the data processing framework to enable community-driven dataset
expansion. This work establishes a foundational framework for multimodal
anti-fraud research while addressing critical challenges in data privacy and
scenario diversity. The project will be released at
https://github.com/JimmyMa99/TeleAntiFraud.Summary
AI-Generated Summary