ChatPaper.aiChatPaper

像金魚一樣,不要死記硬背!緩解生成式LLM中的記憶現象

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

June 14, 2024
作者: Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, Tom Goldstein
cs.AI

摘要

大型語言模型可能會記憶並重複其訓練數據,導致隱私和版權風險。為了減輕記憶,我們引入了對下一令牌訓練目標的微妙修改,我們稱之為金魚損失。在訓練期間,從損失計算中排除了隨機抽樣的令牌子集。這些被放棄的令牌不會被模型記憶,這可以防止從訓練集中完全重現一整個令牌鏈。我們進行了大量實驗,訓練了十億規模的Llama-2模型,包括預訓練和從頭開始訓練,並展示了可提取記憶的顯著減少,對下游基準幾乎沒有影響。
English
Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subset of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.

Summary

AI-Generated Summary

PDF81December 6, 2024