LLMはトレーニングデータを漏洩し得るが、それは意図的なのか？ LLMにおける記憶の傾向を考慮した評価

要旨

大規模言語モデルは訓練データを再現することができるが、既存の記憶化評価は主にモデルが強制的にそうさせられるかどうかを測定するものであり、通常の使用においてそうするかどうかを測定するものではない。我々は、プリフィックスに基づく能力攻撃と非敵対的評価を対比する、記憶化評価のための傾向認識型フレームワークPropMeを紹介する。我々は、既存の関数に適用することで傾向指標を作成できるメトリック変換を提案する。さらに、infini-gram上に構築された軽量トレーシングパイプラインSimpleTraceを導入する。これは、モデルの生成を大規模訓練コーパスに決定論的に帰属させ、逐語的、ほぼ逐語的、および傾向変換された記憶化指標を計算する。二つの完全公開モデル（CommaとDFM Decoder）を、二つのデータセット（Common PileとDynaword）で二言語において評価したところ、能力と傾向の間に一貫したギャップがあることがわかった。プリフィックス攻撃は、汎用またはデータセット固有のプロンプトよりもかなり強い記憶化シグナルを引き出す一方、傾向スコアは全体的に低いままである。したがって、モデルは直接的に引き出された場合には訓練データを明らかにすることができるが、より一般的な非敵対的設定ではめったにそうならない。また、Commaから継続的に事前学習されたDFM Decoderは、Common Pileに対する記憶化および記憶化傾向が低減していることがわかり、後の訓練で部分的に異なるデータに重点が置かれると記憶化能力が低下し得ることが確認された。我々の結果は、記憶化監査において、この現象のより包括的な理解を得るために、最悪ケースの抽出可能性と通常の漏洩傾向の両方を報告すべきであることを示唆しており、我々はそうすることを推奨する。

English

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.