코끼리는 절대 잊지 않는다: 대형 언어 모델에서의 표 형식 데이터 기억 및 학습

초록

대규모 언어 모델(LLM)이 다양한 작업에 적용될 수 있음을 보여준 연구가 많지만, 데이터 오염과 암기라는 중요한 문제는 종종 간과되곤 합니다. 본 연구에서는 테이블 형식 데이터에 대한 이러한 우려를 다룹니다. 구체적으로, 우리는 언어 모델이 학습 중에 특정 테이블 데이터셋을 접했는지 여부를 평가하기 위한 다양한 기법을 소개합니다. 이 조사를 통해 LLM이 많은 인기 있는 테이블 데이터셋을 그대로 암기하고 있음이 밝혀졌습니다. 그런 다음, 학습 중에 접한 데이터셋과 학습 이후에 공개된 데이터셋에 대한 LLM의 퓨샷 학습 성능을 비교합니다. 그 결과, LLM은 학습 중에 접한 데이터셋에서 더 나은 성능을 보이며, 이는 암기가 과적합으로 이어짐을 시사합니다. 동시에, LLM은 새로운 데이터셋에서도 상당한 성능을 보이며 데이터 변환에 놀라울 정도로 강건함을 나타냅니다. 또한, 우리는 LLM의 컨텍스트 내 통계적 학습 능력을 조사합니다. 미세 조정 없이는 이러한 능력이 제한적임을 발견했습니다. 이는 새로운 데이터셋에 대한 퓨샷 성능이 대부분 LLM의 세계 지식에 기인함을 시사합니다. 전반적으로, 우리의 결과는 평가 데이터셋이 사전 학습 중에 LLM에 노출되었는지 테스트하는 것의 중요성을 강조합니다. 우리는 개발한 노출 테스트를 tabmemcheck 파이썬 패키지로 공개하며, 이는 https://github.com/interpretml/LLM-Tabular-Memorization-Checker에서 확인할 수 있습니다.

English

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. Without fine-tuning, we find them to be limited. This suggests that much of the few-shot performance on novel datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We make the exposure tests we developed available as the tabmemcheck Python package at https://github.com/interpretml/LLM-Tabular-Memorization-Checker

코끼리는 절대 잊지 않는다: 대형 언어 모델에서의 표 형식 데이터 기억 및 학습

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

초록

Summary

Support

Support