TeleAntiFraud-28k: 통신 사기 탐지를 위한 오디오-텍스트 슬로우-씽킹 데이터셋

초록

통신 사기 탐지는 오디오 신호와 추론 중심의 텍스트 분석을 통합한 고품질의 다중모드(multimodal) 학습 데이터가 부족하다는 점에서 상당한 어려움에 직면해 있습니다. 이러한 격차를 해결하기 위해, 우리는 자동화된 통신 사기 분석을 위해 특별히 설계된 첫 번째 오픈소스 오디오-텍스트 "느린 사고(slow-thinking)" 데이터셋인 TeleAntiFraud-28k를 제안합니다. 우리의 데이터셋은 세 가지 전략을 통해 구축되었습니다: (1) 자동 음성 인식(ASR)으로 전사된 통화 녹음(원본 오디오는 익명 처리)을 사용하여 개인정보 보호가 적용된 텍스트-진실 샘플 생성 및 텍스트-음성(TTS) 모델 재생성을 통해 현실 세계의 일관성 보장; (2) 실제 ASR 출력에 대해 대형 언어 모델(LLM) 기반 자기 지도(self-instruction) 샘플링을 통해 시나리오 범위 확장 및 의미론적 강화; (3) 사전 정의된 통신 시나리오와 사기 유형을 통해 신흥 사기 전술을 시뮬레이션하는 다중 에이전트 적대적 합성. 생성된 데이터셋은 28,511개의 엄격하게 처리된 음성-텍스트 쌍으로 구성되며, 사기 추론을 위한 상세한 주석이 포함되어 있습니다. 데이터셋은 시나리오 분류, 사기 탐지, 사기 유형 분류의 세 가지 작업으로 나뉩니다. 또한, 우리는 데이터셋에서 비례적으로 샘플링된 인스턴스로 구성된 표준화된 평가 벤치마크인 TeleAntiFraud-Bench를 구축하여 통신 사기 탐지 작업에서 모델 성능을 체계적으로 테스트할 수 있도록 합니다. 또한, 우리는 실제/합성 데이터를 혼합하여 학습한 프로덕션 최적화된 지도 미세 조정(SFT) 모델을 제공하며, 데이터 처리 프레임워크를 오픈소스로 공개하여 커뮤니티 주도의 데이터셋 확장을 가능하게 합니다. 이 작업은 데이터 프라이버시와 시나리오 다양성이라는 중요한 문제를 해결하면서 다중모드 반사기 연구를 위한 기초 프레임워크를 마련합니다. 이 프로젝트는 https://github.com/JimmyMa99/TeleAntiFraud에서 공개될 예정입니다.

English

The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.

TeleAntiFraud-28k: 통신 사기 탐지를 위한 오디오-텍스트 슬로우-씽킹 데이터셋

TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

초록

Support