大型語言模型中的表格數據記憶與學習：大象永不忘記

摘要

儘管許多人已經展示了大型語言模型（LLMs）如何應用於各種任務，但資料污染和記憶化這兩個關鍵問題通常被忽略。在這項研究中，我們針對表格數據解決了這個問題。具體來說，我們引入了各種不同的技術來評估語言模型在訓練期間是否見過表格數據集。這項研究揭示了LLMs已經逐字記憶了許多熱門的表格數據集。然後，我們比較了LLMs在訓練期間看到的數據集和訓練後釋放的數據集上的少樣本學習表現。我們發現LLMs在訓練期間看到的數據集上表現更好，這表明記憶化導致了過度擬合。同時，LLMs在新數據集上表現出非平凡的性能，並且對數據轉換驚人地具有韌性。然後，我們調查了LLMs的上下文統計學習能力。在沒有進行微調的情況下，我們發現它們受限。這表明在新數據集上的少樣本表現很大程度上是由LLMs的世界知識所致。總的來說，我們的結果突顯了測試LLM是否在預訓練期間見過評估數據集的重要性。我們將我們開發的曝光測試作為tabmemcheck Python套件提供，網址為https://github.com/interpretml/LLM-Tabular-Memorization-Checker。

English

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. Without fine-tuning, we find them to be limited. This suggests that much of the few-shot performance on novel datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We make the exposure tests we developed available as the tabmemcheck Python package at https://github.com/interpretml/LLM-Tabular-Memorization-Checker

大型語言模型中的表格數據記憶與學習：大象永不忘記

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

摘要

Support