STRIDE: 通过子集扰动进行稀疏恢复的训练数据归因

摘要

训练数据归因旨在将模型预测追溯至其训练数据。该领域的黄金标准依赖因果干预，通过观察数据增减时模型的变化来实现，但对于大语言模型而言，反复重新训练在计算上极具挑战。因此，大多数方法利用梯度在参数空间近似这种效应。然而，追踪数十亿参数的梯度不仅成本高得难以承受，且依赖于局部近似。本文提出一种思路转变：不再估计参数变化，而是在激活空间中建模训练数据的功能效应。我们引入STRIDE（基于导向的训练数据影响力分解）框架，该框架将训练数据归因形式化为压缩感知框架下的稀疏恢复问题。STRIDE学习轻量级“导向算子”，模拟基于数据子集训练引发的行为偏移。通过测量这些算子如何扰动测试预测，我们利用稀疏线性分解恢复单个训练样本的影响力。STRIDE在大语言模型预训练归因中达到最先进水平，同时比先前方法快一个数量级（13倍）。我们进一步通过数据选择、数据污染检测和定性分析等下游应用验证其实用价值。

English

Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude (13times) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.