ICon：自动数据选择的上下文贡献方法

摘要

指令微调中的数据选择对于提升大语言模型（LLMs）性能及降低训练成本至关重要。然而，现有的自动化选择方法要么依赖于计算开销大的基于梯度的度量，要么依赖于人工设计的启发式规则，这些方法可能无法充分利用数据的内在属性。本文提出了一种新颖的无梯度方法——基于上下文学习的贡献度测量（ICon），该方法利用上下文学习（ICL）隐含的微调特性，无需梯度计算或人工指标设计即可衡量样本贡献。ICon为基于梯度的方法提供了一种计算高效的替代方案，并减少了启发式方法中固有的人类归纳偏差。ICon包含三个组成部分，通过评估ICL隐含学习下的性能变化来识别高贡献数据。在三个LLMs上跨越12个基准和5对评估集的大量实验验证了ICon的有效性。值得注意的是，在LLaMA3.1-8B上，使用ICon选取的15%数据训练的模型，其表现比使用完整数据集高出5.42个百分点，并超越广泛使用的选择方法的最佳性能2.06个百分点。我们进一步分析了ICon选取的高贡献样本，发现它们不仅任务多样，难度适中，而非仅仅是最难的任务。

English

Data selection for instruction tuning is essential for improving the performance of Large Language Models (LLMs) and reducing training cost. However, existing automated selection methods either depend on computationally expensive gradient-based measures or manually designed heuristics, which may fail to fully exploit the intrinsic attributes of data. In this paper, we propose In-context Learning for Contribution Measurement (ICon), a novel gradient-free method that takes advantage of the implicit fine-tuning nature of in-context learning (ICL) to measure sample contribution without gradient computation or manual indicators engineering. ICon offers a computationally efficient alternative to gradient-based methods and reduces human inductive bias inherent in heuristic-based approaches. ICon comprises three components and identifies high-contribution data by assessing performance shifts under implicit learning through ICL. Extensive experiments on three LLMs across 12 benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of ICon. Remarkably, on LLaMA3.1-8B, models trained on 15% of ICon-selected data outperform full datasets by 5.42% points and exceed the best performance of widely used selection methods by 2.06% points. We further analyze high-contribution samples selected by ICon, which show both diverse tasks and appropriate difficulty levels, rather than just the hardest ones.

ICon：自动数据选择的上下文贡献方法

ICon: In-Context Contribution for Automatic Data Selection

摘要

Support