プライバシー保護型マスキングからの復元における大規模言語モデルの活用

要旨

モデル適応は、プロキシのトレーニングデータと実際のユーザーデータの間の不一致を処理するために重要です。効果的な適応を行うため、ユーザーのテキストデータは通常、サーバーまたはローカルデバイスに保存され、そのドメイン内データを使用して下流の自然言語処理（NLP）モデルを直接トレーニングすることができます。しかし、これによりユーザー情報が敵対者にさらされるリスクが増加し、プライバシーとセキュリティに関する懸念が生じる可能性があります。最近では、テキストデータ内の識別情報を汎用マーカーに置き換える手法が探求されています。本研究では、大規模言語モデル（LLM）を活用してマスクされたトークンの代替候補を提案し、下流の言語モデリングタスクでの有効性を評価します。具体的には、複数の事前学習済みおよびファインチューニングされたLLMベースのアプローチを提案し、これらの手法を比較するためにさまざまなデータセットで実証研究を行います。実験結果は、プライバシー保護トークンマスキングを行わない元のデータでトレーニングされたモデルと同等の性能を、難読化コーパスでトレーニングされたモデルが達成できることを示しています。

English

Model adaptation is crucial to handle the discrepancy between proxy training data and actual users data received. To effectively perform adaptation, textual data of users is typically stored on servers or their local devices, where downstream natural language processing (NLP) models can be directly trained using such in-domain data. However, this might raise privacy and security concerns due to the extra risks of exposing user information to adversaries. Replacing identifying information in textual data with a generic marker has been recently explored. In this work, we leverage large language models (LLMs) to suggest substitutes of masked tokens and have their effectiveness evaluated on downstream language modeling tasks. Specifically, we propose multiple pre-trained and fine-tuned LLM-based approaches and perform empirical studies on various datasets for the comparison of these methods. Experimental results show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data without privacy-preserving token masking.

プライバシー保護型マスキングからの復元における大規模言語モデルの活用

Recovering from Privacy-Preserving Masking with Large Language Models

要旨

Support