Struct-Bench：差分隱私結構化文本生成的基準測試

摘要

差分隱私（DP）合成數據生成是一項頗具前景的技術，它能夠利用那些原本無法公開用於模型訓練或其他分析的私有數據集。儘管大量研究文獻聚焦於生成私有的非結構化文本和圖像數據，但在企業環境中，結構化數據（如表格數據）更為常見，且往往包含自然語言字段或組件。現有的合成數據評估技術（如FID）難以捕捉此類數據集的結構特性與關聯性。在本研究中，我們提出了Struct-Bench，這是一個用於評估源自包含自然語言數據的結構化數據集的合成數據的框架與基準。Struct-Bench框架要求用戶提供其數據集結構的表示，即上下文無關文法（CFG）。我們的基準包含5個真實世界數據集和2個合成生成的數據集，每個數據集均附有CFG註釋。我們展示這些數據集即使對於最先進的DP合成數據生成方法也構成了顯著挑戰。Struct-Bench還囊括了不同指標的參考實現和一個排行榜，從而為研究人員提供了一個標準化的評估平台，用以對比和研究隱私保護的合成數據生成方法。此外，我們還展示了一個案例研究，說明如何利用Struct-Bench來提升Private Evolution（PE）在結構化數據上的合成數據質量。該基準及排行榜已公開於https://struct-bench.github.io。

English

Differentially private (DP) synthetic data generation is a promising technique for utilizing private datasets that otherwise cannot be exposed for model training or other analytics. While much research literature has focused on generating private unstructured text and image data, in enterprise settings, structured data (e.g., tabular) is more common, often including natural language fields or components. Existing synthetic data evaluation techniques (e.g., FID) struggle to capture the structural properties and correlations of such datasets. In this work, we propose Struct-Bench, a framework and benchmark for evaluating synthetic datasets derived from structured datasets that contain natural language data. The Struct-Bench framework requires users to provide a representation of their dataset structure as a Context-Free Grammar (CFG). Our benchmark comprises 5 real-world and 2 synthetically generated datasets, each annotated with CFGs. We show that these datasets demonstrably present a great challenge even for state-of-the-art DP synthetic data generation methods. Struct-Bench also includes reference implementations of different metrics and a leaderboard, thereby providing researchers a standardized evaluation platform to benchmark and investigate privacy-preserving synthetic data generation methods. Further, we also present a case study showing how to use Struct-Bench to improve the synthetic data quality of Private Evolution (PE) on structured data. The benchmark and the leaderboard have been publicly made available at https://struct-bench.github.io.

Struct-Bench：差分隱私結構化文本生成的基準測試

Struct-Bench: A Benchmark for Differentially Private Structured Text Generation

摘要

Support