데이터 오염은 언어 장벽을 넘어설 수 있다

초록

대규모 언어 모델(LLM) 개발 과정에서의 불투명성은 사전 학습 데이터에 포함된 공개 벤치마크의 잠재적 오염에 대한 우려를 점점 더 불러일으키고 있다. 기존의 오염 탐지 방법은 일반적으로 학습 데이터와 평가 데이터 간의 텍스트 중첩을 기반으로 하는데, 이는 더 깊은 형태의 오염을 반영하기에는 너무 피상적일 수 있다. 본 논문에서는 먼저, 벤치마크 테스트 세트의 번역 버전에 LLM을 과적합시켜 현재의 탐지 방법을 회피하면서 LLM의 성능을 부풀리는 교차 언어 형태의 오염을 소개한다. 그런 다음, 이러한 깊숙이 숨겨진 오염을 밝히기 위해 일반화 기반 접근 방식을 제안한다. 구체적으로, 원래 벤치마크에서 잘못된 답안 선택지를 다른 질문의 정답으로 대체한 후 LLM의 성능 변화를 검토한다. 오염된 모델은 모든 선택지가 정답인 더 쉬운 상황으로 일반화하기 어려운데, 이는 잘못된 선택지가 틀릴 필요조차 없기 때문이다. 실험 결과는 교차 언어 오염이 기존 탐지 방법을 쉽게 속일 수 있지만, 우리의 방법은 그렇지 않음을 보여준다. 또한, 교차 언어 오염을 LLM의 작동 메커니즘 해석과 사후 학습을 통해 다국어 능력을 강화하는 데 활용할 가능성에 대해 논의한다. 사용된 코드와 데이터셋은 https://github.com/ShangDataLab/Deep-Contam에서 확인할 수 있다.

English

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be not even wrong, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from https://github.com/ShangDataLab/Deep-Contam.

데이터 오염은 언어 장벽을 넘어설 수 있다

Data Contamination Can Cross Language Barriers

초록

Support