基于最小充分表示学习的面向大语言模型的领域特定数据合成

摘要

大语言模型在通用能力上已取得显著进展，并可通过领域特定数据的微调在特定领域实现强大性能。然而，获取目标领域的高质量数据仍是一项重大挑战。现有数据合成方法遵循演绎范式，严重依赖以自然语言表达的显式领域描述和精细的提示工程，这限制了其在难以用自然语言表述或正式界定的真实场景中的适用性。本研究通过归纳范式解决领域特定数据合成这一尚未充分探索的问题——目标域仅通过一组参考示例定义，尤其适用于领域特征难以用自然语言阐述的场景。我们提出新型框架DOMINO，从参考样本中学习最小充分领域表征，并利用该表征指导生成领域对齐的合成数据。DOMINO深度融合提示调优与对比解耦目标，将领域级模式与样本特定噪声分离，在保留核心领域特征的同时缓解过拟合。理论上，我们证明DOMINO扩展了合成数据分布的支撑集，确保更高多样性。实验表明，在领域定义隐含的挑战性编程基准测试中，基于DOMINO合成数据微调的方法相比强大的指令微调基线模型，Pass@1准确率提升高达4.63%，验证了其有效性与鲁棒性。本研究为领域特定数据合成建立新范式，无需手动设计提示或自然语言领域规范即可实现实用且可扩展的领域适配。

English

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.