LLM은 학습 데이터를 유출할 수 있지만, 그러려고 할까? LLM의 암기에 대한 성향 인식 평가

초록

대규모 언어 모델은 훈련 데이터를 재현할 수 있지만, 기존의 기억화 평가는 대부분 모델이 강제로 그렇게 하도록 할 수 있는지 여부를 측정할 뿐, 일반적인 사용 상황에서 실제로 그렇게 하는지 여부는 측정하지 않는다. 본 논문에서는 접두사 기반 능력 공격과 비적대적 평가를 대비하는 성향 인지 기억화 평가 프레임워크인 PropMe를 소개한다. 기존 함수에 적용하여 성향 메트릭을 생성할 수 있는 메트릭 변환 방법을 제안한다. 또한, infini-gram 기반의 경량 추적 파이프라인인 SimpleTrace를 도입하여, 대규모 훈련 코퍼스에 대한 모델 생성 결과를 결정론적으로 귀속시키고, 문자 그대로, 거의 문자 그대로, 그리고 성향 변환된 기억화 메트릭을 계산한다. 두 개의 완전 공개 모델(Comma와 DFM Decoder)을 두 개의 데이터셋(Common Pile과 Dynaword)에서 두 언어로 평가한 결과, 능력과 성향 사이에 일관된 차이가 발견되었다. 접두사 공격은 일반적이거나 데이터셋 특화된 프롬프트보다 훨씬 강한 기억화 신호를 유도한 반면, 성향 점수는 전반적으로 낮게 유지되었다. 따라서 모델은 직접 유도될 경우 훈련 데이터를 드러낼 수 있지만, 보다 일반적인 비적대적 환경에서는 그러한 경우가 드물다. 또한, Comma에서 지속적 사전 학습된 DFM Decoder는 Common Pile에 대해 기억화 및 기억화 성향이 감소하여, 이후 학습이 부분적으로 다른 데이터를 강조할 경우 기억화 능력이 감소할 수 있음을 확인하였다. 이러한 결과는 기억화 감사가 현상에 대한 보다 포괄적인 시각을 제공하기 위해 최악의 경우 추출 가능성과 일반적인 누출 성향을 모두 보고해야 함을 시사하며, 이를 권장한다.

English

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.