ELTEX:面向領域驅動的合成數據生成框架
ELTEX: A Framework for Domain-Driven Synthetic Data Generation
March 19, 2025
作者: Arina Razmyslovich, Kseniia Murasheva, Sofia Sedlova, Julien Capitaine, Eugene Dmitriev
cs.AI
摘要
我們提出ELTEX(高效大型語言模型詞元提取),這是一個針對特定領域生成高質量合成訓練數據的領域驅動框架。儘管大型語言模型(LLMs)展現了令人印象深刻的通用能力,但在如網絡安全等專業領域中,其性能仍受限於領域特定訓練數據的稀缺性。ELTEX通過系統性地整合顯式領域指示符提取與動態提示,在整個生成過程中保留關鍵領域知識,從而應對這一挑戰。我們在區塊鏈相關網絡攻擊檢測的背景下展示了ELTEX的有效性,其中我們使用真實數據與ELTEX生成數據的不同組合對Gemma-2B進行微調。結果顯示,ELTEX增強後的模型在標準分類指標和不確定性校準方面均達到了與GPT-4相當的性能,同時所需計算資源顯著減少。我們發布了一個精選的合成數據集,包含用於區塊鏈網絡攻擊檢測的社交媒體文本。我們的工作表明,領域驅動的合成數據生成能夠有效彌合資源高效模型與大型架構在專業領域中的性能差距。
English
We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework
for generating high-quality synthetic training data in specialized domains.
While Large Language Models (LLMs) have shown impressive general capabilities,
their performance in specialized domains like cybersecurity remains limited by
the scarcity of domain-specific training data. ELTEX addresses this challenge
by systematically integrating explicit domain indicator extraction with dynamic
prompting to preserve critical domain knowledge throughout the generation
process. We demonstrate ELTEX's effectiveness in the context of
blockchain-related cyberattack detection, where we fine-tune Gemma-2B using
various combinations of real and ELTEX-generated data. Our results show that
the ELTEX-enhanced model achieves performance competitive with GPT-4 across
both standard classification metrics and uncertainty calibration, while
requiring significantly fewer computational resources. We release a curated
synthetic dataset of social media texts for cyberattack detection in
blockchain. Our work demonstrates that domain-driven synthetic data generation
can effectively bridge the performance gap between resource-efficient models
and larger architectures in specialized domains.Summary
AI-Generated Summary