ELTEX: 도메인 주도 합성 데이터 생성을 위한 프레임워크

초록

우리는 특수 분야에서 고품질의 합성 훈련 데이터를 생성하기 위한 도메인 주도 프레임워크인 ELTEX(Efficient LLM Token Extraction)를 소개한다. 대규모 언어 모델(LLMs)은 일반적인 능력에서 인상적인 성능을 보여주지만, 사이버보안과 같은 특수 분야에서는 도메인 특화 훈련 데이터의 부족으로 인해 성능이 제한된다. ELTEX는 이러한 문제를 해결하기 위해 명시적 도메인 지표 추출과 동적 프롬프팅을 체계적으로 통합하여 생성 과정 전반에 걸쳐 중요한 도메인 지식을 보존한다. 우리는 블록체인 관련 사이버 공격 탐지 맥락에서 ELTEX의 효과를 입증하며, 실제 데이터와 ELTEX 생성 데이터의 다양한 조합을 사용하여 Gemma-2B를 미세 조정하였다. 실험 결과, ELTEX로 강화된 모델은 표준 분류 지표와 불확실성 보정 측면에서 GPT-4와 경쟁력 있는 성능을 달성하면서도 상당히 적은 계산 자원을 요구한다. 우리는 블록체인에서의 사이버 공격 탐지를 위한 소셜 미디어 텍스트의 정제된 합성 데이터셋을 공개한다. 이 연구는 도메인 주도 합성 데이터 생성이 특수 분야에서 자원 효율적인 모델과 더 큰 아키텍처 간의 성능 격차를 효과적으로 해소할 수 있음을 보여준다.

English

We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework for generating high-quality synthetic training data in specialized domains. While Large Language Models (LLMs) have shown impressive general capabilities, their performance in specialized domains like cybersecurity remains limited by the scarcity of domain-specific training data. ELTEX addresses this challenge by systematically integrating explicit domain indicator extraction with dynamic prompting to preserve critical domain knowledge throughout the generation process. We demonstrate ELTEX's effectiveness in the context of blockchain-related cyberattack detection, where we fine-tune Gemma-2B using various combinations of real and ELTEX-generated data. Our results show that the ELTEX-enhanced model achieves performance competitive with GPT-4 across both standard classification metrics and uncertainty calibration, while requiring significantly fewer computational resources. We release a curated synthetic dataset of social media texts for cyberattack detection in blockchain. Our work demonstrates that domain-driven synthetic data generation can effectively bridge the performance gap between resource-efficient models and larger architectures in specialized domains.