《目标指令选择的批判性审视:厘清关键因素与无效环节》
A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)
February 16, 2026
作者: Nihal V. Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis
cs.AI
摘要
大型语言模型(LLM)的指令微调通常涉及从大型候选池中选取指令训练数据的子集,并使用目标任务中的小型查询集。尽管相关研究日益受到关注,但针对性指令选择的文献仍呈现碎片化且不透明:各方法在选取预算上差异巨大,常忽略零样本基线,且频繁混淆关键组件的贡献。这导致实践者缺乏针对目标任务的指令选择操作指南。本研究通过解构并系统分析两个核心要素——数据表征与选择算法,旨在厘清这一领域的研究图景。我们提出的框架支持跨模型、任务和预算的受控比较。研究发现,仅基于梯度的数据表征方法能持续在不同数据集和模型中,使所选子集与查询集的相似度有效预测性能表现。虽然尚无单一方法占据绝对优势,但在低预算条件下,基于梯度的表征配合贪心循环选择算法通常能取得最佳平均效果,不过这种优势会随预算增加而减弱。最后,我们将多种现有选择算法统一为选定子集与查询集间近似距离最小化的不同形式,并通过新的泛化边界理论支持这一观点。总体而言,我们的研究结果为LLM微调中更规范化的数据选择提供了关键见解和理论基础。代码已发布于https://github.com/dcml-lab/targeted-instruction-selection。
English
Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.