ChatPaper.aiChatPaper

基于词元级数据筛选的能力塑造

Shaping capabilities with token-level data filtering

January 29, 2026
作者: Neil Rathi, Alec Radford
cs.AI

摘要

当前减少语言模型中不良能力的方法多为事后干预,易被攻击者规避。一种自然的替代方案是在预训练阶段直接塑造模型能力。以消除医疗能力为代理任务,我们发现简单的预训练数据过滤干预措施在大规模应用中具有高效性、鲁棒性和低成本优势。受数据归因研究启发,我们证明基于词元的过滤比文档过滤更有效,能在降低对良性能力影响的同时实现同等程度的不良能力抑制。通过训练跨越两个数量级的模型,我们进一步发现过滤效果随规模扩大而增强:对于最大模型,词元过滤可使目标遗忘领域的计算速度降低7000倍。研究还表明,经过词元过滤训练的模型仍可在遗忘领域进行对齐优化。在此过程中,我们提出了通过稀疏自编码器标注词元、蒸馏廉价高质量分类器的方法论,并证明在足够预训练计算量下,过滤机制对噪声标签具有鲁棒性。
English
Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.
PDF113January 31, 2026