SearchInstruct:透過基於檢索的指令資料集創建強化領域適應能力
SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation
September 12, 2025
作者: Iman Barati, Mostafa Amiri, Heshaam Faili
cs.AI
摘要
監督式微調(Supervised Fine-Tuning, SFT)對於訓練大型語言模型(Large Language Models, LLMs)至關重要,它能顯著提升如指令遵循和上下文學習等關鍵能力。然而,由於特定領域的獨特限制和數據稀缺性,創建適合的訓練數據集仍具挑戰性。本文提出SearchInstruct,一種創新方法,專門設計用於構建高質量的SFT指令數據集。我們的方法始於一組有限的領域特定、由人類生成的問題,這些問題通過大型語言模型系統性地擴展。隨後,動態檢索領域相關資源,為每個擴展問題生成準確且上下文適宜的答案。實驗評估表明,SearchInstruct提升了SFT數據集的多樣性和質量,從而在專業領域內實現了LLM性能的可觀提升。此外,我們展示該方法不僅限於數據集生成,還能有效促進如模型編輯等任務,實現對現有模型的高效更新。為促進可重現性和社區採用,我們在公開的Git倉庫中提供了完整的實現細節、生成的指令-響應對全集以及源代碼:[https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)
English
Supervised Fine-Tuning (SFT) is essential for training large language models
(LLMs), significantly enhancing critical capabilities such as instruction
following and in-context learning. Nevertheless, creating suitable training
datasets tailored for specific domains remains challenging due to unique domain
constraints and data scarcity. In this paper, we propose SearchInstruct, an
innovative method explicitly designed to construct high quality instruction
datasets for SFT. Our approach begins with a limited set of domain specific,
human generated questions, which are systematically expanded using a large
language model. Subsequently, domain relevant resources are dynamically
retrieved to generate accurate and contextually appropriate answers for each
augmented question. Experimental evaluation demonstrates that SearchInstruct
enhances both the diversity and quality of SFT datasets, leading to measurable
improvements in LLM performance within specialized domains. Additionally, we
show that beyond dataset generation, the proposed method can also effectively
facilitate tasks such as model editing, enabling efficient updates to existing
models. To facilitate reproducibility and community adoption, we provide full
implementation details, the complete set of generated instruction response
pairs, and the source code in a publicly accessible Git repository:
[https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)