可控生成技术用于隐私保护合成文本

摘要

文本匿名化对于在医疗、社会服务及法律等高风险领域负责任地开发与部署人工智能至关重要。本研究提出了一种新颖的隐私保护合成文本生成方法，该方法结合了去标识化原则与“隐于市”（HIPS）理论。我们的方法引入了实体感知控制码，通过上下文学习（ICL）或前缀调优来引导可控生成。其中，ICL变体确保了与底层去标识系统一致的隐私级别，而前缀调优变体则采用定制掩码策略和损失函数，以支持可扩展的高质量生成。在司法和临床数据集上的实验表明，我们的方法在隐私保护与实用性之间实现了良好平衡，为敏感领域的合成文本生成提供了一个实用且有效的解决方案。

English

Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.

可控生成技术用于隐私保护合成文本

Controlled Generation for Private Synthetic Text

摘要

Support