基於最小充分表徵學習的領域特定資料合成於大型語言模型

摘要

大型語言模型在通用能力方面已展現出顯著進展，並可透過在領域特定數據上進行微調，在特定領域中實現強勁表現。然而，取得目標領域的高品質數據仍是一大挑戰。現有的數據合成方法遵循演繹範式，高度依賴以自然語言表達的明確領域描述以及精心的提示工程，這限制了它們在難以用語言描述或正式表述的實際場景中的適用性。在本研究中，我們透過歸納範式來處理這個尚未充分探索的領域特定數據合成問題——目標領域僅透過一組參考範例來定義，特別是在領域特徵難以用自然語言表述的情況下。我們提出了一個名為 DOMINO 的新框架，該框架從參考樣本中學習一個最小充分的領域表徵，並利用它來引導生成與領域對齊的合成數據。DOMINO 將提示調整與對比解耦目標相結合，以分離領域層次的模式與樣本特定的雜訊，從而在保留核心領域特徵的同時減輕過擬合。理論上，我們證明 DOMINO 擴展了合成數據分佈的支撐集，確保了更大的多樣性。在實證上，針對領域定義隱含且具挑戰性的程式碼基準測試，使用 DOMINO 合成的數據進行微調後，在強勁的指令微調骨幹模型基礎上，Pass@1 準確率提升了高達 4.63%，證明了其有效性與穩健性。本研究為領域特定數據合成建立了新典範，能在無需手動設計提示或自然語言領域規範的情況下，實現實用且可擴展的領域適應。

English

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.