개인 정보 보호를 위한 합성 텍스트의 제어된 생성

초록

텍스트 익명화는 의료, 사회 복지, 법률과 같은 고위험 분야에서 AI를 책임감 있게 개발하고 배포하기 위해 필수적입니다. 본 연구에서는 비식별화 원칙과 Hiding In Plain Sight(HIPS) 이론을 활용한 프라이버시 보존형 합성 텍스트 생성 방법론을 제안합니다. 우리의 접근 방식은 엔티티 인식 제어 코드를 도입하여 인컨텍스트 학습(ICL) 또는 프리픽스 튜닝을 사용한 제어 가능한 생성을 안내합니다. ICL 변형은 기본 비식별화 시스템과 일치하는 프라이버시 수준을 보장하며, 프리픽스 튜닝 변형은 확장 가능한 고품질 생성을 지원하기 위해 맞춤형 마스킹 전략과 손실 함수를 통합합니다. 법률 및 임상 데이터셋에 대한 실험 결과, 우리의 방법은 프라이버시 보호와 유용성 사이에서 강력한 균형을 달성하며, 민감한 분야에서의 합성 텍스트 생성을 위한 실용적이고 효과적인 솔루션을 제공함을 보여줍니다.

English

Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privacy-preserving synthetic text generation that leverages the principles of de-identification and the Hiding In Plain Sight (HIPS) theory. Our approach introduces entity-aware control codes to guide controllable generation using either in-context learning (ICL) or prefix tuning. The ICL variant ensures privacy levels consistent with the underlying de-identification system, while the prefix tuning variant incorporates a custom masking strategy and loss function to support scalable, high-quality generation. Experiments on legal and clinical datasets demonstrate that our method achieves a strong balance between privacy protection and utility, offering a practical and effective solution for synthetic text generation in sensitive domains.