ChatPaper.aiChatPaper

对定向指令选择的批判性审视:厘清关键因素与非关键因素

A Critical Look at Targeted Instruction Selection: Disentangling What Matters (and What Doesn't)

February 16, 2026
作者: Nihal V. Nayak, Paula Rodriguez-Diaz, Neha Hulkund, Sara Beery, David Alvarez-Melis
cs.AI

摘要

大型语言模型(LLM)的指令微调通常需要从大规模候选池中筛选指令训练数据子集,并利用目标任务的小规模查询集进行操作。尽管相关研究日益受到关注,但针对目标指令选择的文献仍呈现碎片化且不够透明:不同方法在筛选预算上差异显著,常忽略零样本基线,且频繁混淆关键组件的贡献。这导致实践者缺乏针对目标任务的指令筛选操作指南。本研究通过解构并系统分析两大核心要素——数据表征与筛选算法,旨在厘清该领域现状。我们提出的框架支持跨模型、跨任务、跨预算的受控比较。研究发现,仅基于梯度的数据表征方法能持续实现筛选子集与查询集的相似度预测性能,该结论在不同数据集和模型上均成立。虽然尚无单一方法占据绝对优势,但在低预算条件下,基于梯度的表征配合贪心循环筛选算法平均表现最佳,但这种优势随预算增加而减弱。此外,我们将多种现有筛选算法统一为选定子集与查询集间近似距离最小化的不同形式,并通过新的泛化边界理论支持该观点。总体而言,本研究为LLM微调中更规范的数据筛选提供了关键见解与理论基础。代码已发布于https://github.com/dcml-lab/targeted-instruction-selection。
English
Instruction fine-tuning of large language models (LLMs) often involves selecting a subset of instruction training data from a large candidate pool, using a small query set from the target task. Despite growing interest, the literature on targeted instruction selection remains fragmented and opaque: methods vary widely in selection budgets, often omit zero-shot baselines, and frequently entangle the contributions of key components. As a result, practitioners lack actionable guidance on selecting instructions for their target tasks. In this work, we aim to bring clarity to this landscape by disentangling and systematically analyzing the two core ingredients: data representation and selection algorithms. Our framework enables controlled comparisons across models, tasks, and budgets. We find that only gradient-based data representations choose subsets whose similarity to the query consistently predicts performance across datasets and models. While no single method dominates, gradient-based representations paired with a greedy round-robin selection algorithm tend to perform best on average at low budgets, but these benefits diminish at larger budgets. Finally, we unify several existing selection algorithms as forms of approximate distance minimization between the selected subset and the query set, and support this view with new generalization bounds. More broadly, our findings provide critical insights and a foundation for more principled data selection in LLM fine-tuning. The code is available at https://github.com/dcml-lab/targeted-instruction-selection.
PDF02February 18, 2026