ChatPaper.aiChatPaper

关于代码生成评估数据集的信息泄漏

On Leakage of Code Generation Evaluation Datasets

July 10, 2024
作者: Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen Gilsenan-McMahon, Matthias Gallé
cs.AI

摘要

本文讨论了代码生成测试集中的污染问题,特别是它们在现代大型语言模型中的使用。我们讨论了三种可能导致这种污染的来源,并展示了支持每种来源的发现:(i) 直接数据泄漏,(ii) 通过使用合成数据间接数据泄漏,以及 (iii) 在模型选择过程中对评估集过拟合。我们的发现的关键在于一个包含161个提示及其相关Python解决方案的新数据集,该数据集已发布在 https://huggingface.co/datasets/CohereForAI/lbpp。
English
In this paper we consider contamination by code generation test sets, in particular in their use in modern large language models. We discuss three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage through the use of synthetic data and (iii) overfitting to evaluation sets during model selection. Key to our findings is a new dataset of 161 prompts with their associated python solutions, dataset which is released at https://huggingface.co/datasets/CohereForAI/lbpp .

Summary

AI-Generated Summary

PDF63November 28, 2024