大型語言模型可能洩漏訓練資料，但她們有意如此嗎？一項考量傾向性的記憶評估研究

摘要

大型語言模型能夠重現訓練資料，但現有的記憶化評估大多衡量模型是否可以被強制執行此行為，而非在一般使用情境下是否會自發如此。我們提出PropMe，一個基於傾向性的記憶化評估框架，對比了基於前綴的能力攻擊與非對抗性評估。我們提出一套度量轉換方法，應用於現有函數後可建立傾向性度量。我們進一步引入SimpleTrace，一個基於infini-gram的輕量追蹤管道，能確定性地將模型生成歸因到大規模訓練語料，並計算逐字、近似逐字及傾向性轉換後的記憶化度量。我們在兩個完全開源模型（Comma與DFM Decoder）上，使用兩個資料集（Common Pile與Dynaword）並涵蓋兩種語言進行評估，發現能力與傾向性之間存在一致差距：前綴攻擊引發的記憶化訊號遠強於一般提示或資料集特定提示，而傾向性分數整體仍偏低。因此，模型在被直接誘導時可以揭露訓練資料，但在更常見的非對抗性設定中則很少如此。我們也發現，從Comma持續預訓練而來的DFM Decoder，在Common Pile上表現出較低的記憶化及記憶傾向性，這證實當後續訓練著重部分不同的資料時，記憶化能力確實可能下降。我們的結果建議（我們也鼓勵）記憶化審計應同時報告最壞情況下的可提取性與一般洩漏傾向性，以便對這一現象有更全面的理解。

English

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.