大象永远不会忘记：大型语言模型中表格数据的记忆与学习

摘要

尽管许多人已经展示了大型语言模型（LLMs）如何应用于各种任务，但数据污染和记忆的关键问题经常被忽略。在这项工作中，我们针对表格数据解决了这一问题。具体而言，我们引入了各种不同的技术来评估语言模型在训练过程中是否看到过表格数据集。这项研究揭示了LLMs已经逐字记忆了许多流行的表格数据集。然后，我们比较了LLMs在训练过程中看到的数据集和训练后发布的数据集上的少样本学习性能。我们发现LLMs在训练过程中看到的数据集上表现更好，表明记忆导致了过拟合。与此同时，LLMs在新数据集上表现出非平凡的性能，并且对数据转换具有惊人的鲁棒性。接着，我们调查了LLMs的上下文统计学习能力。在没有微调的情况下，我们发现它们的能力有限。这表明在新数据集上的少样本性能很大程度上归因于LLMs的世界知识。总的来说，我们的结果突显了在预训练期间测试LLMs是否看到过评估数据集的重要性。我们将开发的曝光测试作为tabmemcheck Python包提供，网址为https://github.com/interpretml/LLM-Tabular-Memorization-Checker。

English

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. Without fine-tuning, we find them to be limited. This suggests that much of the few-shot performance on novel datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We make the exposure tests we developed available as the tabmemcheck Python package at https://github.com/interpretml/LLM-Tabular-Memorization-Checker

大象永远不会忘记：大型语言模型中表格数据的记忆与学习

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

摘要

Summary

Support

Support