大语言模型能够泄露训练数据,但它们是否愿意?一种基于倾向感知的LLM记忆评估
LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs
June 4, 2026
作者: Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech
cs.AI
摘要
大型语言模型能够复现训练数据,但现有记忆评估大多衡量模型是否能在强制条件下复现数据,而非在常规使用中真实复现。我们提出PropMe——一种基于倾向感知的记忆评估框架,将基于前缀的能力攻击与非对抗性评估进行对比。我们设计了一种度量转换方法,将其应用于现有函数能够生成倾向性指标。我们进一步提出SimpleTrace——基于infini-gram构建的轻量级追踪流水线,可确定性溯源模型生成内容至大规模训练语料库,并计算逐字匹配、近似匹配及倾向性转换后的记忆指标。通过对两语言环境下两个数据集(Common Pile和Dynaword)上两个完全开源模型(Comma和DFM Decoder)的评估,我们发现能力与倾向性之间存在持续差距:前缀攻击引发的记忆信号显著强于通用提示或特定数据集提示,而倾向性得分整体保持较低水平。这表明模型在直接诱导下能够暴露训练数据,但在更常见的非对抗性设置中很少发生。我们还发现,从Comma持续预训练得到的DFM Decoder在Common Pile上的记忆能力与记忆倾向性均有所降低,证实当后续训练侧重部分不同数据时记忆能力可能减弱。我们的研究结果表明(并建议)记忆审计应同时报告最坏情况下的可提取性和常规泄漏倾向性,以更全面地认知该现象。
English
Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.