Struct-Bench: 差分プライバシーを考慮した構造化テキスト生成のためのベンチマーク

要旨

差分プライバシー（DP）を適用した合成データ生成は、本来ならモデル学習やその他の分析に利用できないプライベートなデータセットを活用するための有望な技術です。これまでの研究文献では、非構造化テキストや画像データのプライベート生成に焦点が当てられてきましたが、企業環境では、自然言語フィールドやコンポーネントを含む構造化データ（例えば表形式データ）がより一般的です。既存の合成データ評価手法（例：FID）では、このようなデータセットの構造的特性や相関関係を捉えることが困難です。本研究では、自然言語データを含む構造化データセットから生成された合成データを評価するためのフレームワークおよびベンチマークであるStruct-Benchを提案します。Struct-Benchフレームワークでは、ユーザーがデータセットの構造を文脈自由文法（CFG）として表現する必要があります。私たちのベンチマークは、5つの実世界データセットと2つの合成生成データセットで構成され、それぞれにCFGが注釈付けされています。これらのデータセットが、最先端のDP合成データ生成手法にとっても大きな課題であることを示します。Struct-Benchには、さまざまなメトリクスのリファレンス実装とリーダーボードも含まれており、研究者がプライバシー保護型合成データ生成手法をベンチマークし、調査するための標準化された評価プラットフォームを提供します。さらに、構造化データに対するPrivate Evolution（PE）の合成データ品質を向上させるためにStruct-Benchを活用する方法を示すケーススタディも提示します。ベンチマークとリーダーボードは、https://struct-bench.github.io で公開されています。

English

Differentially private (DP) synthetic data generation is a promising technique for utilizing private datasets that otherwise cannot be exposed for model training or other analytics. While much research literature has focused on generating private unstructured text and image data, in enterprise settings, structured data (e.g., tabular) is more common, often including natural language fields or components. Existing synthetic data evaluation techniques (e.g., FID) struggle to capture the structural properties and correlations of such datasets. In this work, we propose Struct-Bench, a framework and benchmark for evaluating synthetic datasets derived from structured datasets that contain natural language data. The Struct-Bench framework requires users to provide a representation of their dataset structure as a Context-Free Grammar (CFG). Our benchmark comprises 5 real-world and 2 synthetically generated datasets, each annotated with CFGs. We show that these datasets demonstrably present a great challenge even for state-of-the-art DP synthetic data generation methods. Struct-Bench also includes reference implementations of different metrics and a leaderboard, thereby providing researchers a standardized evaluation platform to benchmark and investigate privacy-preserving synthetic data generation methods. Further, we also present a case study showing how to use Struct-Bench to improve the synthetic data quality of Private Evolution (PE) on structured data. The benchmark and the leaderboard have been publicly made available at https://struct-bench.github.io.