ChatPaper.aiChatPaper

释放数据海啸的力量:关于语言模型指导调整的数据评估和选择的综合调查

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

August 4, 2024
作者: Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, Xing Sun
cs.AI

摘要

指导调整在将大型语言模型(LLMs)与人类偏好对齐中发挥关键作用。尽管存在大量开放的指导数据集,但简单地训练LLM使用所有现有指导可能并不是最佳选择,也不切实际。为了找出最有益的数据点,自然语言处理(NLP)和深度学习领域提出了数据评估和选择方法。然而,在指导调整的背景下,仍然存在一个知识空白,即可以采用何种数据评估指标以及如何将其整合到选择机制中。为了弥补这一空白,我们对现有文献进行了全面回顾,特别是针对LLMs的指导调整的数据评估和选择方法。我们将所有适用方法系统地分类为基于质量、基于多样性和基于重要性的方法,构建了一个统一的、细粒度的分类法。对于每个类别,详细阐述了代表性方法,描述了相关研究的格局。此外,对最新方法进行了官方报告结果的比较,以深入讨论它们的局限性。最后,我们总结了开放挑战,并提出了未来研究的有前途的方向。所有相关内容可在https://github.com/yuleiqin/fantastic-data-engineering找到。
English
Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at https://github.com/yuleiqin/fantastic-data-engineering.

Summary

AI-Generated Summary

PDF194November 28, 2024