SearchInstruct: 검색 기반 명령어 데이터셋 생성을 통한 도메인 적응 강화

초록

지도 미세조정(Supervised Fine-Tuning, SFT)은 대규모 언어 모델(LLM)을 훈련하는 데 필수적이며, 명령어 수행 및 문맥 학습과 같은 중요한 능력을 크게 향상시킵니다. 그러나 특정 도메인에 맞춤화된 적절한 훈련 데이터셋을 생성하는 것은 고유한 도메인 제약과 데이터 부족으로 인해 여전히 어려운 과제입니다. 본 논문에서는 SFT를 위한 고품질 명령어 데이터셋을 구축하기 위해 명시적으로 설계된 혁신적인 방법인 SearchInstruct를 제안합니다. 우리의 접근 방식은 도메인 특화된 소량의 인간 생성 질문으로 시작하며, 이를 대규모 언어 모델을 사용해 체계적으로 확장합니다. 이후, 각 확장된 질문에 대해 정확하고 문맥에 적합한 답변을 생성하기 위해 도메인 관련 리소스를 동적으로 검색합니다. 실험 평가 결과, SearchInstruct는 SFT 데이터셋의 다양성과 품질을 모두 향상시켜 특수 도메인 내에서 LLM 성능의 측정 가능한 개선을 이끌어냄을 보여줍니다. 또한, 제안된 방법이 데이터셋 생성 이상으로 모델 편집과 같은 작업에도 효과적으로 기여하여 기존 모델의 효율적인 업데이트를 가능하게 함을 보여줍니다. 재현성과 커뮤니티 채용을 돕기 위해, 우리는 전체 구현 세부 사항, 생성된 명령어-응답 쌍의 완전한 세트, 그리고 소스 코드를 공개적으로 접근 가능한 Git 저장소에 제공합니다: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)

English

Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)