可控生成技術於隱私保護合成文本之應用
Controlled Generation for Private Synthetic Text
September 30, 2025
作者: Zihao Zhao, Anjalie Field
cs.AI
摘要
文本匿名化對於在醫療保健、社會服務和法律等高風險領域負責任地開發和部署人工智慧至關重要。在本研究中,我們提出了一種新穎的隱私保護合成文本生成方法,該方法結合了去識別化原則和「隱於無形」(HIPS)理論。我們的方法引入了實體感知控制碼,以引導使用上下文學習(ICL)或前綴調節的可控生成。ICL變體確保隱私水平與基礎去識別化系統一致,而前綴調節變體則結合了自定義遮罩策略和損失函數,以支持可擴展的高質量生成。在法律和臨床數據集上的實驗表明,我們的方法在隱私保護和實用性之間達到了良好的平衡,為敏感領域的合成文本生成提供了一種實用且有效的解決方案。
English
Text anonymization is essential for responsibly developing and deploying AI
in high-stakes domains such as healthcare, social services, and law. In this
work, we propose a novel methodology for privacy-preserving synthetic text
generation that leverages the principles of de-identification and the Hiding In
Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes
to guide controllable generation using either in-context learning (ICL) or
prefix tuning. The ICL variant ensures privacy levels consistent with the
underlying de-identification system, while the prefix tuning variant
incorporates a custom masking strategy and loss function to support scalable,
high-quality generation. Experiments on legal and clinical datasets demonstrate
that our method achieves a strong balance between privacy protection and
utility, offering a practical and effective solution for synthetic text
generation in sensitive domains.