프라이버시 보호 마스킹에서의 복원: 대형 언어 모델을 활용한 접근

초록

모델 적응은 프록시 학습 데이터와 실제 사용자 데이터 간의 불일치를 처리하는 데 중요합니다. 효과적인 적응을 수행하기 위해 사용자의 텍스트 데이터는 일반적으로 서버나 로컬 디바이스에 저장되며, 이러한 도메인 내 데이터를 사용하여 다운스트림 자연어 처리(NLP) 모델을 직접 학습시킬 수 있습니다. 그러나 이는 사용자 정보를 공격자에게 노출시킬 수 있는 추가적인 위험으로 인해 개인정보 보호와 보안 문제를 야기할 수 있습니다. 최근에는 텍스트 데이터의 식별 정보를 일반 마커로 대체하는 방법이 연구되고 있습니다. 본 연구에서는 대형 언어 모델(LLM)을 활용하여 마스킹된 토큰의 대체어를 제안하고, 이를 다운스트림 언어 모델링 작업에서의 효과를 평가합니다. 구체적으로, 우리는 여러 사전 학습 및 미세 조정된 LLM 기반 접근법을 제안하고, 다양한 데이터셋에 대한 실험적 연구를 수행하여 이러한 방법들을 비교합니다. 실험 결과는 개인정보 보호를 위한 토큰 마스킹 없이 원본 데이터로 학습된 모델과 비교할 때, 난독화된 코퍼스로 학습된 모델이 비슷한 성능을 달성할 수 있음을 보여줍니다.

English

Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.

프라이버시 보호 마스킹에서의 복원: 대형 언어 모델을 활용한 접근

Recovering from Privacy-Preserving Masking with Large Language Models

초록

Support