데이터 쓰나미의 힘 발휘: 언어 모델의 튜닝을 위한 데이터 평가 및 선택에 대한 포괄적 조사

초록

지시 조정은 대형 언어 모델(LLMs)을 인간의 선호와 조화롭게 맞추는 데 중요한 역할을 합니다. 열려 있는 방대한 양의 지시 데이터셋이 있음에도 불구하고, 모든 기존 지시에 대해 단순히 LLM을 훈련시키는 것이 최적이고 실용적이지 않을 수 있습니다. 가장 유익한 데이터포인트를 정확히 파악하기 위해 자연어 처리(NLP)와 심층 학습 분야에서 데이터 평가 및 선택 방법이 제안되었습니다. 그러나 지시 조정의 맥락에서는 어떤 종류의 데이터 평가 지표가 사용될 수 있는지 및 이를 선택 메커니즘에 어떻게 통합할 수 있는지에 대한 지식적 공백이 여전히 존재합니다. 이 공백을 메우기 위해, LLM의 지시 조정을 위한 데이터 평가 및 선택에 특히 관련된 기존 문헌에 대한 포괄적인 검토를 제시합니다. 우리는 모든 적용 가능한 방법을 품질 기반, 다양성 기반 및 중요성 기반으로 체계적으로 분류하고 통합된 세분화된 분류법을 구축합니다. 각 범주에 대해 대표적인 방법이 상세히 설명되어 관련 연구의 풍경을 묘사합니다. 또한, 최신 방법들 간의 비교를 공식적으로 보고된 결과를 바탕으로 실시하여 그 한계에 대한 심층적인 토론을 제공합니다. 마지막으로, 미래 연구를 위한 유망한 방향을 제안하고 오픈된 도전 과제를 요약합니다. 모든 관련 콘텐츠는 https://github.com/yuleiqin/fantastic-data-engineering에서 확인할 수 있습니다.

English

Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at https://github.com/yuleiqin/fantastic-data-engineering.

데이터 쓰나미의 힘 발휘: 언어 모델의 튜닝을 위한 데이터 평가 및 선택에 대한 포괄적 조사

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

초록

Support