データ汚染は言語の壁を越える可能性がある

要旨

大規模言語モデル（LLM）の開発における不透明性が、事前学習データにおける公開ベンチマークの潜在的な汚染についての懸念を高めています。既存の汚染検出方法は、通常、学習データと評価データ間のテキストの重複に基づいており、これはより深い形の汚染を反映するには表面的すぎる場合があります。本論文では、まず、ベンチマークテストセットの翻訳版にLLMを過剰適合させることで、現在の検出方法を回避しながらLLMの性能を誇張する、クロスリンガルな形の汚染を提示します。次に、そのような深く隠された汚染を明らかにするための一般化ベースのアプローチを提案します。具体的には、元のベンチマークを変更し、誤った選択肢を他の問題からの正しい選択肢に置き換えた後のLLMの性能変化を調べます。汚染されたモデルは、そのようなより簡単な状況に一般化することがほとんどできません。なぜなら、誤った選択肢が全く間違っていない場合でも、すべての選択肢が彼らの記憶において正しいからです。実験結果は、クロスリンガルな汚染が既存の検出方法を簡単に欺くことができるが、私たちの方法には欺かれないことを示しています。さらに、LLMの動作メカニズムを解釈するためや、多言語能力を強化するためにトレーニング後のLLMを活用する可能性について議論します。使用したコードとデータセットは、https://github.com/ShangDataLab/Deep-Contam から入手できます。

English

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be not even wrong, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from https://github.com/ShangDataLab/Deep-Contam.

データ汚染は言語の壁を越える可能性がある

Data Contamination Can Cross Language Barriers

要旨

Support