最小十分表現学習による大規模言語モデルのためのドメイン特化データ合成

要旨

大規模言語モデルは汎用的な能力において顕著な進歩を示しており、ドメイン固有のデータによるファインチューニングを通じて特定の領域で高い性能を達成することができる。しかし、対象ドメインの高品質なデータを取得することは依然として大きな課題である。既存のデータ合成手法は演繹的なパラダイムに従い、自然言語で表現された明示的なドメイン記述と注意深いプロンプトエンジニアリングに大きく依存しており、ドメインの記述や形式的な表現が困難な現実世界のシナリオでは適用性が制限される。本研究では、ドメイン特性を自然言語で表現することが困難な場合に特に、対象ドメインが参照サンプルの集合によってのみ定義される帰納的パラダイムを通じて、未だ十分に探求されていないドメイン固有データ合成の問題に取り組む。我々は、参照サンプルから最小限かつ十分なドメイン表現を学習し、それを活用してドメインに整合した合成データの生成を導く新しいフレームワークDOMINOを提案する。 DOMINOは、プロンプトチューニングと対照的ディスタングルメント目的関数を統合し、ドメインレベルのパターンをサンプル固有のノイズから分離することで、過学習を軽減しつつ中核的なドメイン特性を保持する。理論的には、DOMINOが合成データ分布のサポートを拡張し、より大きな多様性を保証することを証明する。経験的には、ドメイン定義が暗黙的である困難なコーディングベンチマークにおいて、DOMINOによって合成されたデータでファインチューニングを行うことで、強力なインストラクションチューニング済みバックボーンと比較してPass@1精度が最大4.63%向上し、その有効性とロバスト性を示している。本研究は、ドメイン固有データ合成の新しいパラダイムを確立し、手動によるプロンプト設計や自然言語によるドメイン仕様を必要とせずに、実用的でスケーラブルなドメイン適応を可能にする。

English

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.