Struct-Bench: 차등 프라이버시 구조화 텍스트 생성을 위한 벤치마크

초록

차등 프라이버시(DP) 합성 데이터 생성은 그렇지 않으면 모델 학습이나 기타 분석에 사용할 수 없는 민감한 데이터셋을 활용하기 위한 유망한 기술입니다. 많은 연구 문헌이 비정형 텍스트 및 이미지 데이터의 프라이버시 보호 생성에 초점을 맞추고 있지만, 기업 환경에서는 자연어 필드나 구성 요소를 포함하는 구조화된 데이터(예: 테이블 형식)가 더 일반적입니다. 기존의 합성 데이터 평가 기술(예: FID)은 이러한 데이터셋의 구조적 특성과 상관관계를 포착하는 데 어려움을 겪습니다. 본 연구에서는 자연어 데이터를 포함하는 구조화된 데이터셋에서 파생된 합성 데이터셋을 평가하기 위한 프레임워크 및 벤치마크인 Struct-Bench를 제안합니다. Struct-Bench 프레임워크는 사용자가 데이터셋 구조를 문맥 자유 문법(CFG)으로 표현하도록 요구합니다. 우리의 벤치마크는 각각 CFG로 주석이 달린 5개의 실제 데이터셋과 2개의 합성 데이터셋으로 구성됩니다. 우리는 이러한 데이터셋이 최신 DP 합성 데이터 생성 방법에도 상당한 도전을 제시함을 보여줍니다. Struct-Bench는 또한 다양한 메트릭의 참조 구현과 리더보드를 포함하여, 연구자들에게 프라이버시 보호 합성 데이터 생성 방법을 벤치마크하고 조사할 수 있는 표준화된 평가 플랫폼을 제공합니다. 더 나아가, 우리는 Struct-Bench를 사용하여 구조화된 데이터에 대한 Private Evolution(PE)의 합성 데이터 품질을 개선하는 방법을 보여주는 사례 연구도 제시합니다. 벤치마크와 리더보드는 https://struct-bench.github.io에서 공개적으로 제공됩니다.

English

Differentially private (DP) synthetic data generation is a promising technique for utilizing private datasets that otherwise cannot be exposed for model training or other analytics. While much research literature has focused on generating private unstructured text and image data, in enterprise settings, structured data (e.g., tabular) is more common, often including natural language fields or components. Existing synthetic data evaluation techniques (e.g., FID) struggle to capture the structural properties and correlations of such datasets. In this work, we propose Struct-Bench, a framework and benchmark for evaluating synthetic datasets derived from structured datasets that contain natural language data. The Struct-Bench framework requires users to provide a representation of their dataset structure as a Context-Free Grammar (CFG). Our benchmark comprises 5 real-world and 2 synthetically generated datasets, each annotated with CFGs. We show that these datasets demonstrably present a great challenge even for state-of-the-art DP synthetic data generation methods. Struct-Bench also includes reference implementations of different metrics and a leaderboard, thereby providing researchers a standardized evaluation platform to benchmark and investigate privacy-preserving synthetic data generation methods. Further, we also present a case study showing how to use Struct-Bench to improve the synthetic data quality of Private Evolution (PE) on structured data. The benchmark and the leaderboard have been publicly made available at https://struct-bench.github.io.