数据污染可以跨越语言障碍。

摘要

在开发大型语言模型（LLMs）时的不透明性引起了人们对预训练数据中可能污染公共基准的日益关注。现有的污染检测方法通常基于训练和评估数据之间的文本重叠，这种方法可能过于肤浅，无法反映更深层次的污染形式。本文首先提出了一种跨语言形式的污染，通过在基准测试集的翻译版本上过度拟合LLMs，从而提高了LLMs的性能，同时避开了当前的检测方法。然后，我们提出了基于泛化的方法来揭示这种深藏的污染。具体而言，我们检查了将原始基准测试集中的错误答案选择替换为其他问题中的正确答案后，LLM的性能变化。受污染的模型几乎无法泛化到这种更容易的情况，其中错误选择甚至可能不是错误的，因为它们都正确地被记住了。实验结果表明，跨语言污染可以轻易愚弄现有的检测方法，但无法愚弄我们的方法。此外，我们讨论了在解释LLMs的工作机制和在后期训练LLMs以增强多语言能力方面，跨语言污染的潜在利用。我们使用的代码和数据集可从https://github.com/ShangDataLab/Deep-Contam 获取。

English

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be not even wrong, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from https://github.com/ShangDataLab/Deep-Contam.

数据污染可以跨越语言障碍。

Data Contamination Can Cross Language Barriers

摘要

Summary

Support

Support