データ合成のための大規模言語モデル

要旨

現実世界の分布の統計的構造を忠実に捉えた合成データを生成することは、データモデリングにおける根本的な課題である。従来のアプローチは、強力なパラメトリックな仮定や手動による構造設計に依存することが多く、高次元や異種混合の領域では困難を抱えている。大規模言語モデル（LLM）の最近の進展は、現実世界の分布に対する柔軟で高次元な事前分布としての可能性を示している。しかし、データ合成に適用する場合、標準的なLLMベースのサンプリングは非効率的で、固定されたコンテキスト制限に縛られ、統計的な整合性を保証できない。この問題を踏まえ、我々はLLMSynthorを導入する。これは、LLMを分布フィードバックに基づいた構造認識シミュレータに変換する、データ合成のための汎用フレームワークである。LLMSynthorは、LLMを高次依存性をモデル化するためのノンパラメトリックなコピュラシミュレータとして扱い、LLM提案サンプリングを導入して、棄却を必要とせずにサンプリング効率を向上させる根拠のある提案分布を生成する。要約統計量空間における不一致を最小化することにより、反復的な合成ループは実データと合成データを整合させながら、潜在的な生成構造を徐々に明らかにし、洗練していく。我々は、プライバシーに敏感な領域（例：eコマース、人口統計、移動データ）における構造化および非構造化形式を含む異種混合データセットを用いて、LLMSynthorを制御された環境および実世界の設定で評価する。LLMSynthorが生成する合成データは、高い統計的忠実度、実用的な有用性、およびデータ間の適応性を示し、経済学、社会科学、都市研究をはじめとする幅広い分野で貴重なツールとして位置づけられる。

English

Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.

データ合成のための大規模言語モデル

Large Language Models for Data Synthesis

要旨

Support