ChatPaper.aiChatPaper

Privasis:从零开始构建最大的"公共"私有数据集

Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

February 3, 2026
作者: Hyunwoo Kim, Niloofar Mireshghallah, Michael Duan, Rui Xin, Shuyue Stella Li, Jaehun Jung, David Acuna, Qi Pang, Hanshen Xiao, G. Edward Suh, Sewoong Oh, Yulia Tsvetkov, Pang Wei Koh, Yejin Choi
cs.AI

摘要

涉及隐私敏感数据的研究长期受限于数据稀缺问题,这与受益于数据规模化的其他领域形成鲜明对比。随着现代人工智能代理(如OpenClaw和Gemini Agent)被授予持续访问高度敏感个人信息的权限,这一挑战日益紧迫。为突破这一长期瓶颈并应对不断升级的风险,我们推出首个完全从零构建的百万规模全合成数据集Privasis(即"隐私绿洲")——一个包含丰富多样隐私信息的文本资源库,旨在拓宽和加速那些必须处理敏感社会数据的研究领域。相较于现有数据集,包含140万条记录的Privasis在保证质量的前提下实现了数量级的规模突破,并在医疗记录、法律文书、财务档案、日程安排和短信等文档类型上展现出更广泛的多样性,总计包含5510万个标注属性(如族裔、出生日期、工作单位等)。我们利用Privasis通过文本分解与定向脱敏流程构建了并行语料库用于文本清理。基于该数据集训练的紧凑型清理模型(≤40亿参数)在性能上超越了GPT-5和Qwen-3 235B等最先进的大语言模型。我们将公开数据、模型及代码,以推动隐私敏感领域及智能代理的后续研究。
English
Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.
PDF11February 5, 2026