可控生成技術於隱私保護合成文本之應用

摘要

文本匿名化對於在醫療保健、社會服務和法律等高風險領域負責任地開發和部署人工智慧至關重要。在本研究中，我們提出了一種新穎的隱私保護合成文本生成方法，該方法結合了去識別化原則和「隱於無形」（HIPS）理論。我們的方法引入了實體感知控制碼，以引導使用上下文學習（ICL）或前綴調節的可控生成。ICL變體確保隱私水平與基礎去識別化系統一致，而前綴調節變體則結合了自定義遮罩策略和損失函數，以支持可擴展的高質量生成。在法律和臨床數據集上的實驗表明，我們的方法在隱私保護和實用性之間達到了良好的平衡，為敏感領域的合成文本生成提供了一種實用且有效的解決方案。

English

Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.