ELTEX:面向领域驱动的合成数据生成框架
ELTEX: A Framework for Domain-Driven Synthetic Data Generation
March 19, 2025
作者: Arina Razmyslovich, Kseniia Murasheva, Sofia Sedlova, Julien Capitaine, Eugene Dmitriev
cs.AI
摘要
我们推出ELTEX(高效大语言模型令牌提取),这是一个面向特定领域的高质量合成训练数据生成框架。尽管大语言模型(LLMs)在通用任务上展现了卓越能力,但在网络安全等专业领域,其性能仍受限于领域特定训练数据的匮乏。ELTEX通过系统性地整合显式领域指示符提取与动态提示技术,确保在生成过程中保留关键领域知识,从而应对这一挑战。我们以区块链相关网络攻击检测为背景,展示了ELTEX的有效性,其中我们利用真实数据与ELTEX生成数据的不同组合对Gemma-2B进行微调。结果表明,ELTEX增强后的模型在标准分类指标和不确定性校准方面均达到了与GPT-4相媲美的性能,同时显著减少了计算资源需求。我们发布了一个精心筛选的社交媒体文本合成数据集,用于区块链中的网络攻击检测。我们的工作证明,领域驱动的合成数据生成能够有效弥合资源高效模型与大型架构在专业领域中的性能差距。
English
We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework
for generating high-quality synthetic training data in specialized domains.
While Large Language Models (LLMs) have shown impressive general capabilities,
their performance in specialized domains like cybersecurity remains limited by
the scarcity of domain-specific training data. ELTEX addresses this challenge
by systematically integrating explicit domain indicator extraction with dynamic
prompting to preserve critical domain knowledge throughout the generation
process. We demonstrate ELTEX's effectiveness in the context of
blockchain-related cyberattack detection, where we fine-tune Gemma-2B using
various combinations of real and ELTEX-generated data. Our results show that
the ELTEX-enhanced model achieves performance competitive with GPT-4 across
both standard classification metrics and uncertainty calibration, while
requiring significantly fewer computational resources. We release a curated
synthetic dataset of social media texts for cyberattack detection in
blockchain. Our work demonstrates that domain-driven synthetic data generation
can effectively bridge the performance gap between resource-efficient models
and larger architectures in specialized domains.Summary
AI-Generated Summary