代碼生成評估數據集的洩漏
On Leakage of Code Generation Evaluation Datasets
July 10, 2024
作者: Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, Matthias Gallé
cs.AI
摘要
本文探討代碼生成測試集的污染問題,特別是在現代大型語言模型中的應用。我們討論了三種可能造成此類污染的來源,並展示支持每種來源的研究結果:(i) 直接數據洩露,(ii) 通過使用合成數據間接數據洩露,以及 (iii) 在模型選擇期間對評估集的過度擬合。
我們研究的關鍵在於一個包含161個提示及其相應Python解決方案的新數據集,該數據集已在https://huggingface.co/datasets/CohereForAI/lbpp 上發布。
English
In this paper we consider contamination by code generation test sets, in
particular in their use in modern large language models. We discuss three
possible sources of such contamination and show findings supporting each of
them: (i) direct data leakage, (ii) indirect data leakage through the use of
synthetic data and (iii) overfitting to evaluation sets during model selection.
Key to our findings is a new dataset of 161 prompts with their associated
python solutions, dataset which is released at
https://huggingface.co/datasets/CohereForAI/lbpp .Summary
AI-Generated Summary