使用大型語言模型進行隱私保護遮罩的恢復

摘要

模型適應對處理代理訓練數據與實際用戶數據之間的差異至關重要。為了有效進行適應，通常將用戶的文本數據存儲在服務器或其本地設備上，從而可以直接使用這些領域內數據對下游自然語言處理（NLP）模型進行訓練。然而，這可能會引發隱私和安全問題，因為將用戶信息暴露給對手的風險增加。最近開展了將文本數據中的識別信息替換為通用標記的研究。在這項工作中，我們利用大型語言模型（LLMs）來建議遮罩標記的替代詞，並對其在下游語言建模任務中的有效性進行評估。具體而言，我們提出了多種基於預訓練和微調的LLM方法，並對各種數據集進行實證研究以比較這些方法。實驗結果表明，在混淆語料庫上訓練的模型能夠達到與在原始數據上訓練的模型相當的性能，而無需保護隱私的標記遮罩。

English

Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.

使用大型語言模型進行隱私保護遮罩的恢復

Recovering from Privacy-Preserving Masking with Large Language Models

摘要

Support