DyePack：利用后门技术可验证地标记大语言模型中的测试集污染

摘要

开放基准对于评估和推进大型语言模型至关重要，它们提供了可复现性和透明度。然而，其易获取性也使其容易成为测试集污染的目标。在本研究中，我们引入了DyePack框架，该框架利用后门攻击来识别在训练过程中使用了基准测试集的模型，而无需访问模型的损失、logits或任何内部细节。正如银行将染料包与现金混合以标记劫匪一样，DyePack将后门样本与测试数据混合，以标记那些在训练中使用了测试数据的模型。我们提出了一种结合多个随机目标后门的原则性设计，使得在标记每个模型时能够精确计算假阳性率（FPR）。这种方法在理论上防止了错误指控，同时为每一个检测到的污染案例提供了强有力的证据。我们在三个数据集上的五个模型上评估了DyePack，涵盖了多项选择和开放式生成任务。对于多项选择题，它成功检测出了所有被污染的模型，在MMLU-Pro和Big-Bench-Hard数据集上，使用八个后门时，保证的FPR分别低至0.000073%和0.000017%。对于开放式生成任务，它在Alpaca数据集上表现出良好的泛化能力，使用六个后门时，以仅0.127%的保证假阳性率识别出了所有被污染的模型。

English

Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.

DyePack：利用后门技术可验证地标记大语言模型中的测试集污染

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

摘要

Support