數據污染可以跨越語言障礙。

摘要

在開發大型語言模型（LLMs）時的不透明度引起了對於預訓練數據中可能存在的公共基準污染日益增長的擔憂。現有的污染檢測方法通常基於訓練和評估數據之間的文本重疊，這可能過於表面，無法反映更深層次的污染形式。在本文中，我們首先提出了一種跨語言形式的污染，通過在基準測試集的翻譯版本上過度擬合LLMs，從而膨脹LLMs的性能，同時逃避當前的檢測方法。然後，我們提出了基於泛化的方法來揭示這種深度隱藏的污染。具體來說，我們檢查了在將原始基準替換為來自其他問題的正確答案選擇後，LLM的性能變化。受污染的模型幾乎無法泛化到這種更簡單的情況，其中虛假選擇甚至可能不是錯誤的，因為在它們的記憶中所有選擇都是正確的。實驗結果表明，跨語言污染可以輕易愚弄現有的檢測方法，但無法愚弄我們的方法。此外，我們討論了在解釋LLMs的工作機制以及在後訓練LLMs以增強多語能力方面，跨語言污染的潛在應用。我們使用的代碼和數據集可從https://github.com/ShangDataLab/Deep-Contam 獲取。

English

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be not even wrong, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from https://github.com/ShangDataLab/Deep-Contam.

數據污染可以跨越語言障礙。

Data Contamination Can Cross Language Barriers

摘要

Support