釋放數據海嘯的力量：關於語言模型指導調整的數據評估和選擇的全面調查

摘要

指令調整在對齊大型語言模型（LLMs）與人類偏好方面扮演著關鍵角色。儘管存在大量開放指令數據集，但對所有現有指令進行單純訓練可能並非最佳且實際。為了找出最有益的數據點，自然語言處理（NLP）和深度學習領域提出了數據評估和選擇方法。然而，在指令調整的背景下，對於可以使用何種數據評估指標以及如何將其整合到選擇機制中仍存在知識上的差距。為了彌合這一差距，我們對現有文獻進行了全面回顧，特別針對LLMs的指令調整的數據評估和選擇進行了分析。我們將所有適用方法系統地分為基於質量、多樣性和重要性的方法，構建了統一、細緻的分類法。對於每個類別，我們詳細說明了代表性方法，以描述相關研究的全貌。此外，我們對最新方法進行了比較，根據官方報告的結果進行了深入討論，以提供對其局限性的深入探討。最後，我們總結了開放挑戰並提出了未來研究的有前途的方向。所有相關內容均可在https://github.com/yuleiqin/fantastic-data-engineering找到。

English

Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at https://github.com/yuleiqin/fantastic-data-engineering.

釋放數據海嘯的力量：關於語言模型指導調整的數據評估和選擇的全面調查

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

摘要

Support