プライベート合成テキストのための制御生成

要旨

テキストの匿名化は、医療、社会福祉、法律などの高リスク領域において、AIを責任を持って開発・展開するために不可欠である。本研究では、個人識別情報の削除原則と「Hiding In Plain Sight (HIPS)」理論を活用した、プライバシー保護型の合成テキスト生成のための新たな方法論を提案する。本手法では、エンティティを意識した制御コードを導入し、インコンテキスト学習（ICL）またはプレフィックスチューニングを用いて制御可能な生成を実現する。ICLバリアントは、基盤となる個人識別情報削除システムと整合性のあるプライバシーレベルを保証し、プレフィックスチューニングバリアントは、カスタムマスキング戦略と損失関数を組み込むことで、スケーラブルで高品質な生成をサポートする。法律および臨床データセットを用いた実験により、本手法がプライバシー保護と有用性の間で強力なバランスを達成し、機密性の高い領域における合成テキスト生成の実用的かつ効果的な解決策を提供することが示された。

English

Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.