Struct-Bench:差分隱私結構化文本生成的基準測試
Struct-Bench: A Benchmark for Differentially Private Structured Text Generation
September 12, 2025
作者: Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, Giulia Fanti
cs.AI
摘要
差分隱私(DP)合成數據生成是一項頗具前景的技術,它能夠利用那些原本無法公開用於模型訓練或其他分析的私有數據集。儘管大量研究文獻聚焦於生成私有的非結構化文本和圖像數據,但在企業環境中,結構化數據(如表格數據)更為常見,且往往包含自然語言字段或組件。現有的合成數據評估技術(如FID)難以捕捉此類數據集的結構特性與關聯性。在本研究中,我們提出了Struct-Bench,這是一個用於評估源自包含自然語言數據的結構化數據集的合成數據的框架與基準。Struct-Bench框架要求用戶提供其數據集結構的表示,即上下文無關文法(CFG)。我們的基準包含5個真實世界數據集和2個合成生成的數據集,每個數據集均附有CFG註釋。我們展示這些數據集即使對於最先進的DP合成數據生成方法也構成了顯著挑戰。Struct-Bench還囊括了不同指標的參考實現和一個排行榜,從而為研究人員提供了一個標準化的評估平台,用以對比和研究隱私保護的合成數據生成方法。此外,我們還展示了一個案例研究,說明如何利用Struct-Bench來提升Private Evolution(PE)在結構化數據上的合成數據質量。該基準及排行榜已公開於https://struct-bench.github.io。
English
Differentially private (DP) synthetic data generation is a promising
technique for utilizing private datasets that otherwise cannot be exposed for
model training or other analytics. While much research literature has focused
on generating private unstructured text and image data, in enterprise settings,
structured data (e.g., tabular) is more common, often including natural
language fields or components. Existing synthetic data evaluation techniques
(e.g., FID) struggle to capture the structural properties and correlations of
such datasets. In this work, we propose Struct-Bench, a framework and benchmark
for evaluating synthetic datasets derived from structured datasets that contain
natural language data. The Struct-Bench framework requires users to provide a
representation of their dataset structure as a Context-Free Grammar (CFG). Our
benchmark comprises 5 real-world and 2 synthetically generated datasets, each
annotated with CFGs. We show that these datasets demonstrably present a great
challenge even for state-of-the-art DP synthetic data generation methods.
Struct-Bench also includes reference implementations of different metrics and a
leaderboard, thereby providing researchers a standardized evaluation platform
to benchmark and investigate privacy-preserving synthetic data generation
methods. Further, we also present a case study showing how to use Struct-Bench
to improve the synthetic data quality of Private Evolution (PE) on structured
data. The benchmark and the leaderboard have been publicly made available at
https://struct-bench.github.io.