Gli elefanti non dimenticano mai: Memorizzazione e apprendimento di dati tabellari nei modelli linguistici di grandi dimensioni

Abstract

Mentre molti hanno dimostrato come i Large Language Model (LLM) possano essere applicati a un'ampia gamma di compiti, le questioni critiche della contaminazione dei dati e della memorizzazione sono spesso trascurate. In questo lavoro, affrontiamo questo problema per i dati tabulari. In particolare, introduciamo una varietà di tecniche per valutare se un modello linguistico ha visto un dataset tabulare durante l'addestramento. Questa indagine rivela che gli LLM hanno memorizzato molti dataset tabulari popolari alla lettera. Successivamente, confrontiamo le prestazioni di apprendimento few-shot degli LLM su dataset visti durante l'addestramento con quelle su dataset pubblicati dopo l'addestramento. Scopriamo che gli LLM performano meglio sui dataset visti durante l'addestramento, indicando che la memorizzazione porta a un overfitting. Allo stesso tempo, gli LLM mostrano prestazioni non banali su dataset nuovi e sono sorprendentemente robusti alle trasformazioni dei dati. Investigiamo poi le capacità di apprendimento statistico in-context degli LLM. Senza fine-tuning, troviamo che queste sono limitate. Ciò suggerisce che gran parte delle prestazioni few-shot su dataset nuovi è dovuta alla conoscenza del mondo dell'LLM. Nel complesso, i nostri risultati evidenziano l'importanza di testare se un LLM ha visto un dataset di valutazione durante il pre-training. Rendiamo disponibili i test di esposizione sviluppati come pacchetto Python tabmemcheck su https://github.com/interpretml/LLM-Tabular-Memorization-Checker.

English

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. Without fine-tuning, we find them to be limited. This suggests that much of the few-shot performance on novel datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We make the exposure tests we developed available as the tabmemcheck Python package at https://github.com/interpretml/LLM-Tabular-Memorization-Checker

Gli elefanti non dimenticano mai: Memorizzazione e apprendimento di dati tabellari nei modelli linguistici di grandi dimensioni

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

Abstract

Support