数据污染可以跨越语言障碍。
Data Contamination Can Cross Language Barriers
June 19, 2024
作者: Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang
cs.AI
摘要
在开发大型语言模型(LLMs)时的不透明性引起了人们对预训练数据中可能污染公共基准的日益关注。现有的污染检测方法通常基于训练和评估数据之间的文本重叠,这种方法可能过于肤浅,无法反映更深层次的污染形式。本文首先提出了一种跨语言形式的污染,通过在基准测试集的翻译版本上过度拟合LLMs,从而提高了LLMs的性能,同时避开了当前的检测方法。然后,我们提出了基于泛化的方法来揭示这种深藏的污染。具体而言,我们检查了将原始基准测试集中的错误答案选择替换为其他问题中的正确答案后,LLM的性能变化。受污染的模型几乎无法泛化到这种更容易的情况,其中错误选择甚至可能不是错误的,因为它们都正确地被记住了。实验结果表明,跨语言污染可以轻易愚弄现有的检测方法,但无法愚弄我们的方法。此外,我们讨论了在解释LLMs的工作机制和在后期训练LLMs以增强多语言能力方面,跨语言污染的潜在利用。我们使用的代码和数据集可从https://github.com/ShangDataLab/Deep-Contam 获取。
English
The opacity in developing large language models (LLMs) is raising growing
concerns about the potential contamination of public benchmarks in the
pre-training data. Existing contamination detection methods are typically based
on the text overlap between training and evaluation data, which can be too
superficial to reflect deeper forms of contamination. In this paper, we first
present a cross-lingual form of contamination that inflates LLMs' performance
while evading current detection methods, deliberately injected by overfitting
LLMs on the translated versions of benchmark test sets. Then, we propose
generalization-based approaches to unmask such deeply concealed contamination.
Specifically, we examine the LLM's performance change after modifying the
original benchmark by replacing the false answer choices with correct ones from
other questions. Contaminated models can hardly generalize to such easier
situations, where the false choices can be not even wrong, as all
choices are correct in their memorization. Experimental results demonstrate
that cross-lingual contamination can easily fool existing detection methods,
but not ours. In addition, we discuss the potential utilization of
cross-lingual contamination in interpreting LLMs' working mechanisms and in
post-training LLMs for enhanced multilingual capabilities. The code and dataset
we use can be obtained from https://github.com/ShangDataLab/Deep-Contam.Summary
AI-Generated Summary