使用大型語言模型進行隱私保護遮罩的恢復
Recovering from Privacy-Preserving Masking with Large Language Models
September 12, 2023
作者: Arpita Vats, Zhe Liu, Peng Su, Debjyoti Paul, Yingyi Ma, Yutong Pang, Zeeshan Ahmed, Ozlem Kalinli
cs.AI
摘要
模型適應對處理代理訓練數據與實際用戶數據之間的差異至關重要。為了有效進行適應,通常將用戶的文本數據存儲在服務器或其本地設備上,從而可以直接使用這些領域內數據對下游自然語言處理(NLP)模型進行訓練。然而,這可能會引發隱私和安全問題,因為將用戶信息暴露給對手的風險增加。最近開展了將文本數據中的識別信息替換為通用標記的研究。在這項工作中,我們利用大型語言模型(LLMs)來建議遮罩標記的替代詞,並對其在下游語言建模任務中的有效性進行評估。具體而言,我們提出了多種基於預訓練和微調的LLM方法,並對各種數據集進行實證研究以比較這些方法。實驗結果表明,在混淆語料庫上訓練的模型能夠達到與在原始數據上訓練的模型相當的性能,而無需保護隱私的標記遮罩。
English
Model adaptation is crucial to handle the discrepancy between proxy training
data and actual users data received. To effectively perform adaptation, textual
data of users is typically stored on servers or their local devices, where
downstream natural language processing (NLP) models can be directly trained
using such in-domain data. However, this might raise privacy and security
concerns due to the extra risks of exposing user information to adversaries.
Replacing identifying information in textual data with a generic marker has
been recently explored. In this work, we leverage large language models (LLMs)
to suggest substitutes of masked tokens and have their effectiveness evaluated
on downstream language modeling tasks. Specifically, we propose multiple
pre-trained and fine-tuned LLM-based approaches and perform empirical studies
on various datasets for the comparison of these methods. Experimental results
show that models trained on the obfuscation corpora are able to achieve
comparable performance with the ones trained on the original data without
privacy-preserving token masking.