명령어 마이닝: 대규모 언어 모델을 위한 고품질 명령어 데이터 선택

초록

대규모 언어 모델은 일반적으로 사전 학습(pre-training)과 미세 조정(fine-tuning)이라는 두 단계의 학습 과정을 거칩니다. 대규모 사전 학습은 모델이 자연스러운 언어 응답을 생성할 수 있는 강력한 능력을 부여하지만, 이러한 사전 학습된 모델도 때때로 인간의 지시를 이해하지 못할 수 있습니다. 언어 모델의 지시 해석 및 응답 능력을 향상시키기 위해, 지시 미세 조정(instruction fine-tuning)은 이 분야에서 중요한 방법론으로 부상했습니다. 최근 연구에 따르면, 대규모 언어 모델은 소량의 고품질 지시-따르기 데이터로도 잘 미세 조정될 수 있음이 밝혀졌습니다. 그러나 언어 모델을 미세 조정하기 위한 고품질 데이터셋의 선택은 여전히 명확한 지침이 부족한 상태입니다. 본 논문에서는 지시-따르기 데이터의 품질을 평가하기 위한 선형 규칙인 InstructMining을 제안합니다. 우리는 InstructMining을 특정 자연어 지표를 사용하여 공식화합니다. 데이터 품질과 이러한 지표 간의 관계를 조사하기 위해, 우리는 광범위한 미세 조정 실험을 추가로 수행합니다. 실험 결과는 InstructMining의 매개변수 추정에 적용됩니다. 성능을 더욱 조사하기 위해, 우리는 InstructMining을 사용하여 보이지 않는 데이터셋에서 고품질 데이터를 선택합니다. 결과는 InstructMining이 다양한 지시-따르기 데이터셋에서 상대적으로 고품질의 샘플을 선택하는 데 도움을 줄 수 있음을 보여줍니다. 필터링되지 않은 데이터셋으로 미세 조정된 모델과 비교했을 때, InstructMining으로 선택된 데이터셋으로 미세 조정된 모델은 42.5%의 경우에서 더 나은 성능을 보였습니다.

English

Large language models typically undergo two training stages, pretraining and finetuning. Despite that large-scale pretraining endows the model with strong capabilities to generate natural language responses, these pretrained models can still fail to understand human instructions at times. To enhance language models' ability of interpreting and responding to instructions, instruction finetuning has emerged as a critical method in this area. Recent studies found that large language models can be finetuned to perform well even with a small amount of high-quality instruction-following data. However, the selection of high-quality datasets for finetuning language models still lacks clear guidelines to follow. In this paper, we propose InstructMining, a linear rule for evaluating instruction-following data quality. We formulate InstructMining using specific natural language indicators. To investigate the relationship between data quality and these indicators, we further conduct extensive finetuning experiments. The experiment results are then applied to estimating parameters in InstructMining. To further investigate its performance, we use InstructMining to select high-quality data from unseen datasets. Results demonstrate that InstructMining can help select relatively high-quality samples from various instruction-following datasets. Compared to models finetuned on unfiltered datasets, models finetuned on InstructMining selected datasets perform better on 42.5% cases.

명령어 마이닝: 대규모 언어 모델을 위한 고품질 명령어 데이터 선택

Instruction Mining: High-Quality Instruction Data Selection for Large Language Models

초록

Support