基于神经元激活图的目标导向预训练数据选择

摘要

日常任务通常具有特定目标，围绕该目标对模型进行预训练可使其成为领域专家。本文通过提出神经元激活图排序（NAG-based Ranking）框架，研究面向目标的语言模型预训练方法。该框架无需额外训练且具有强可解释性，能够基于目标特征筛选预训练数据。与黑箱表征方法不同，我们的技术直接通过现成大语言模型中一组稀疏的高影响力神经元来刻画目标输入特征。具体而言，我们量化神经元影响力，将各层最具影响力的神经元整合为紧凑的神经元激活图（NAG），并通过计算候选数据与目标示例的NAG相似度进行排序。在六个基准测试上的实验表明，基于NAG的排序方法相比随机采样将目标导向的预训练效果平均提升4.9%，在HellaSwag任务上以5.3%的准确率优势超越现有最优基线。在多目标场景下该方法同样有效，最佳配置分别以1.1%和4.1%的优势超越两个基线模型。此外，我们深入分析了NAG的作用机制：当禁用NAG选中的神经元（仅占总数0.12%）时模型性能骤降23.5%，而将NAG限制在最终层会导致平均性能下降4.1%，证明NAG能捕捉学习目标特征的稀疏"功能主干"。代码已发布于https://github.com/asillycat/NAG。

English

Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Concretely, we quantify neuron impact and select the most influential neurons across layers into a compact Neuron-Activated Graph (NAG), and rank candidate data by NAG similarity to target examples. We conduct experiments across six benchmarks, where our NAG-based Ranking improves target-oriented pretraining by 4.9% on average over random sampling, and also outperforms state-of-the-art baselines by 5.3% accuracy on HellaSwag. It also remains effective under a more applicable multi-target setting, where our best setup surpasses two baselines by 1.1% and 4.1%, respectively. Furthermore, we provide a comprehensive analysis on why and how our NAG works, e.g., deactivating NAG-selected neurons (only 0.12% of all) causes a 23.5% performance collapse, and restricting NAG to the final layer incurs a 4.1% average drop, indicating that NAG captures a sparse "functional backbone" for learning target features. We release the code at https://github.com/asillycat/NAG.