ChatPaper.aiChatPaper

Privasis:从零构建最大规模的"公共"私有数据集

Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

February 3, 2026
作者: Hyunwoo Kim, Niloofar Mireshghallah, Michael Duan, Rui Xin, Shuyue Stella Li, Jaehun Jung, David Acuna, Qi Pang, Hanshen Xiao, G. Edward Suh, Sewoong Oh, Yulia Tsvetkov, Pang Wei Koh, Yejin Choi
cs.AI

摘要

涉及隐私敏感数据的研究长期受限于数据稀缺性,与其他受益于数据规模化的领域形成鲜明对比。随着现代AI智能体(如OpenClaw和Gemini Agent)被授予持续访问高度敏感个人信息的权限,这一挑战日益紧迫。为突破这一长期瓶颈并应对日益增长的风险,我们推出首个完全从零构建的百万规模全合成数据集Privasis(即隐私绿洲)——该数据集构建了包含丰富多样隐私信息的文本资源库,旨在推动那些必须处理敏感社会数据的研究领域的发展。相较于现有数据集,包含140万条记录的Privasis在保证质量的前提下实现了数量级的规模突破,其文档类型覆盖医疗记录、法律文书、财务档案、日程安排及短信等,呈现出远超以往的多样性,并标注了共计5510万个属性特征(如种族、出生日期、工作单位等)。我们基于Privasis构建了用于文本脱敏的平行语料库,通过分解文本并实施定向脱敏的流程进行训练。基于该数据集训练的紧凑型脱敏模型(≤40亿参数)在性能上超越了GPT-5、Qwen-3 235B等最先进的大语言模型。我们将公开数据、模型及代码,以加速隐私敏感领域及智能体的未来研究。
English
Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.
PDF11February 5, 2026