최소 충분 표현 학습을 통한 LLM용 도메인 특화 데이터 합성

초록

대규모 언어 모델은 범용 능력에서 놀라운 진전을 보여주었으며, 도메인 특화 데이터에 대한 미세 조정을 통해 특정 분야에서도 강력한 성능을 달성할 수 있다. 그러나 대상 도메인에 대한 고품질 데이터를 확보하는 것은 여전히 중요한 과제로 남아 있다. 기존의 데이터 합성 접근법은 연역적 패러다임을 따르며, 자연어로 표현된 명시적인 도메인 설명과 세심한 프롬프트 엔지니어링에 크게 의존하기 때문에, 도메인을 설명하거나 공식적으로 기술하기 어려운 실제 시나리오에서의 적용 가능성이 제한된다. 본 연구에서는 귀납적 패러다임을 통해 덜 탐구된 도메인 특화 데이터 합성 문제를 다룬다. 이 패러다임에서는 대상 도메인이 오직 참조 예제 집합으로만 정의되며, 특히 도메인 특성을 자연어로 표현하기 어려운 경우에 해당한다. 우리는 DOMINO라는 새로운 프레임워크를 제안한다. 이는 참조 샘플로부터 최소 충분 도메인 표현을 학습하고, 이를 활용하여 도메인에 부합하는 합성 데이터 생성을 안내한다. DOMINO는 프롬프트 튜닝과 대비적 분리 목표를 통합하여 도메인 수준의 패턴을 샘플 특이적 노이즈와 분리함으로써, 핵심 도메인 특성을 보존하면서 과적합을 완화한다. 이론적으로, DOMINO가 합성 데이터 분포의 지지 집합을 확장하여 더 큰 다양성을 보장함을 증명한다. 실증적으로, 도메인 정의가 암시적인 도전적인 코딩 벤치마크에서, DOMINO로 합성된 데이터로 미세 조정한 결과, 강력한 명령어 튜닝된 백본 대비 Pass@1 정확도가 최대 4.63% 향상되어 그 효과성과 견고성을 입증한다. 이 연구는 도메인 특화 데이터 합성을 위한 새로운 패러다임을 확립하여, 수동 프롬프트 설계나 자연어 도메인 사양 없이도 실용적이고 확장 가능한 도메인 적응을 가능하게 한다.

English

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.