AttentionInfluence: 弱から強への事前学習データ選択におけるアテンションヘッドの影響力の採用

要旨

近年、大規模言語モデル（LLM）の複雑な推論能力を向上させるため、推論集約的な事前学習データを収集することへの関心が高まっています。従来のアプローチでは、そのようなデータを識別するために教師あり分類器を利用することが一般的でしたが、これには人間やLLMによるラベリングが必要であり、しばしばドメイン固有のバイアスが導入される問題がありました。コンテキスト内推論においてアテンションヘッドが重要であることに着目し、我々はAttentionInfluenceという、シンプルでありながら効果的で、教師信号を必要としない手法を提案します。このアプローチでは、小さな事前学習済み言語モデルが、単純なアテンションヘッドのマスキング操作を通じて強力なデータセレクターとして機能します。具体的には、検索ヘッドを特定し、これらのヘッドをマスキングした際の損失差を計算します。AttentionInfluenceを1.3Bパラメータの密なモデルに適用し、241BトークンからなるSmolLMコーパスに対してデータ選択を行い、SmolLMコーパスと73Bトークンからなる選択されたサブセットを混合して、1Tの学習トークンとWSD学習率スケジューリングを使用して7Bパラメータの密なモデルを事前学習しました。実験結果は、いくつかの知識集約型および推論重視のベンチマーク（MMLU、MMLU-Pro、AGIEval-en、GSM8K、HumanEval）において、1.4ppから3.5ppの大幅な改善を示しています。これは、小さなモデルが大きなモデルの最終的な性能を向上させるという、弱いモデルから強いモデルへの効果的なスケーリング特性を示しており、推論中心のデータ選択に向けた有望でスケーラブルな道筋を提供しています。

English

Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.

AttentionInfluence: 弱から強への事前学習データ選択におけるアテンションヘッドの影響力の採用

AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

要旨

Support