利用大型语言模型进行隐私保护掩码的恢复

摘要

模型适应对处理代理训练数据与实际用户数据之间的差异至关重要。为了有效进行适应，用户的文本数据通常存储在服务器或他们的本地设备上，下游自然语言处理（NLP）模型可以直接使用这些领域内的数据进行训练。然而，这可能会引发隐私和安全问题，因为将用户信息暴露给对手的风险增加。最近探讨了用通用标记替换文本数据中的识别信息。在这项工作中，我们利用大型语言模型（LLMs）建议掩盖标记的替代物，并评估它们在下游语言建模任务中的有效性。具体来说，我们提出了多种基于预训练和微调的LLM方法，并在各种数据集上进行实证研究以比较这些方法。实验结果表明，在混淆语料库上训练的模型能够达到与在原始数据上训练且不进行隐私保护标记掩盖的模型相当的性能。

English

Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.

利用大型语言模型进行隐私保护掩码的恢复

Recovering from Privacy-Preserving Masking with Large Language Models

摘要

Support