TOFU: 대형 언어 모델을 위한 가상 망각 작업

초록

웹의 방대한 데이터 코퍼스로 훈련된 대규모 언어 모델은 민감하거나 개인적인 데이터를 암기하고 재생산할 수 있어 법적 및 윤리적 문제를 제기합니다. 이러한 문제를 해결하기 위해, 훈련 후 개인 데이터를 보호하는 방법으로 언러닝(unlearning), 즉 모델이 훈련 데이터에 존재하는 정보를 잊도록 조정하는 기법이 제안되었습니다. 여러 언러닝 방법이 존재하지만, 이러한 방법들이 잊혀야 할 데이터가 처음부터 학습되지 않은 모델과 동등한 결과를 내는지 여부는 명확하지 않습니다. 이 문제를 해결하기 위해, 우리는 언러닝에 대한 이해를 심화시키기 위한 벤치마크로 TOFU(Task of Fictitious Unlearning)를 제안합니다. 우리는 20개의 질문-답변 쌍으로 구성된 200개의 다양한 가상 저자 프로필 데이터셋과, 언러닝 대상으로 사용되는 이 프로필의 하위 집합인 '잊기 세트(forget set)'를 제공합니다. 또한, 언러닝 효과를 종합적으로 평가할 수 있는 메트릭 세트를 구성하고, 기존 언러닝 알고리즘의 베이스라인 결과를 제시합니다. 중요한 점은, 우리가 고려한 모든 베이스라인이 효과적인 언러닝을 보여주지 못했다는 것입니다. 이는 모델이 잊기 세트 데이터를 전혀 훈련받지 않은 것처럼 진정으로 행동하도록 조정하는 효과적인 언러닝 접근법 개발을 위한 지속적인 노력이 필요함을 시사합니다.

English

Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.

TOFU: 대형 언어 모델을 위한 가상 망각 작업

TOFU: A Task of Fictitious Unlearning for LLMs

초록

Support