注意力影響力:採用注意力頭影響力進行弱到強的預訓練數據選擇
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
May 12, 2025
作者: Kai Hua, Steven Wu, Ge Zhang, Ke Shen
cs.AI
摘要
近期,收集富含推理能力的预训练数据以提升大型语言模型(LLMs)的复杂推理能力引起了广泛关注。以往的方法通常依赖于有监督的分类器来识别此类数据,这需要人工或LLMs进行标注,往往引入领域特定的偏差。鉴于注意力头在上下文推理中的关键作用,我们提出了AttentionInfluence,一种简单而有效、无需监督信号的训练自由方法。我们的方法通过简单的注意力头掩码操作,使小型预训练语言模型能够作为强大的数据选择器。具体而言,我们识别检索头并计算掩码这些头时的损失差异。我们将AttentionInfluence应用于一个拥有13亿参数的密集模型,在包含2410亿个标记的SmolLM语料库上进行数据选择,并将SmolLM语料库与包含730亿个标记的选定子集混合,使用1万亿训练标记和WSD学习率调度预训练一个拥有70亿参数的密集模型。我们的实验结果表明,在多个知识密集型和推理密集型基准测试(即MMLU、MMLU-Pro、AGIEval-en、GSM8K和HumanEval)上,性能提升显著,范围从1.4个百分点到3.5个百分点。这展示了有效的弱到强扩展特性,即小型模型能够提升大型模型的最终性能,为以推理为中心的数据选择提供了一条有前景且可扩展的路径。
English
Recently, there has been growing interest in collecting reasoning-intensive
pretraining data to improve LLMs' complex reasoning ability. Prior approaches
typically rely on supervised classifiers to identify such data, which requires
labeling by humans or LLMs, often introducing domain-specific biases. Due to
the attention heads being crucial to in-context reasoning, we propose
AttentionInfluence, a simple yet effective, training-free method without
supervision signal. Our approach enables a small pretrained language model to
act as a strong data selector through a simple attention head masking
operation. Specifically, we identify retrieval heads and compute the loss
difference when masking these heads. We apply AttentionInfluence to a
1.3B-parameter dense model to conduct data selection on the SmolLM corpus of
241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B
tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD
learning rate scheduling. Our experimental results demonstrate substantial
improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive
and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and
HumanEval). This demonstrates an effective weak-to-strong scaling property,
with small models improving the final performance of larger models-offering a
promising and scalable path for reasoning-centric data selection.Summary
AI-Generated Summary