Struct-Bench:差分隐私结构化文本生成基准测试
Struct-Bench: A Benchmark for Differentially Private Structured Text Generation
September 12, 2025
作者: Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, Giulia Fanti
cs.AI
摘要
差分隐私(DP)合成数据生成是一项颇具前景的技术,能够利用那些原本无法公开用于模型训练或其他分析的私有数据集。尽管大量研究文献集中于生成私有的非结构化文本和图像数据,但在企业环境中,结构化数据(如表格数据)更为常见,且通常包含自然语言字段或组件。现有的合成数据评估技术(如FID)难以捕捉此类数据集的结构特性和相关性。在本研究中,我们提出了Struct-Bench,一个用于评估源自包含自然语言数据的结构化数据集的合成数据框架与基准。Struct-Bench框架要求用户以上下文无关文法(CFG)形式提供其数据集结构的表示。我们的基准包含5个真实世界数据集和2个合成生成的数据集,每个数据集均标注有CFG。我们展示出,这些数据集即使对于最先进的DP合成数据生成方法也构成了显著挑战。Struct-Bench还整合了不同指标的参考实现及一个排行榜,从而为研究人员提供了一个标准化的评估平台,用以基准测试和探究隐私保护型合成数据生成方法。此外,我们还通过案例研究展示了如何利用Struct-Bench提升Private Evolution(PE)在结构化数据上的合成数据质量。该基准及排行榜已公开于https://struct-bench.github.io。
English
Differentially private (DP) synthetic data generation is a promising
technique for utilizing private datasets that otherwise cannot be exposed for
model training or other analytics. While much research literature has focused
on generating private unstructured text and image data, in enterprise settings,
structured data (e.g., tabular) is more common, often including natural
language fields or components. Existing synthetic data evaluation techniques
(e.g., FID) struggle to capture the structural properties and correlations of
such datasets. In this work, we propose Struct-Bench, a framework and benchmark
for evaluating synthetic datasets derived from structured datasets that contain
natural language data. The Struct-Bench framework requires users to provide a
representation of their dataset structure as a Context-Free Grammar (CFG). Our
benchmark comprises 5 real-world and 2 synthetically generated datasets, each
annotated with CFGs. We show that these datasets demonstrably present a great
challenge even for state-of-the-art DP synthetic data generation methods.
Struct-Bench also includes reference implementations of different metrics and a
leaderboard, thereby providing researchers a standardized evaluation platform
to benchmark and investigate privacy-preserving synthetic data generation
methods. Further, we also present a case study showing how to use Struct-Bench
to improve the synthetic data quality of Private Evolution (PE) on structured
data. The benchmark and the leaderboard have been publicly made available at
https://struct-bench.github.io.