大规模语言模型在数据合成中的应用

摘要

生成能够忠实捕捉现实世界分布统计结构的合成数据，是数据建模领域的一项根本性挑战。传统方法通常依赖于强参数化假设或手动设计结构，在高维或异构领域中表现欠佳。大型语言模型（LLMs）的最新进展揭示了其作为现实世界分布灵活高维先验的潜力。然而，在数据合成应用中，基于标准LLM的采样效率低下，受限于固定上下文长度，且难以确保统计对齐。鉴于此，我们提出了LLMSynthor，一个将LLM转化为由分布反馈引导的结构感知模拟器的通用数据合成框架。LLMSynthor将LLM视为非参数Copula模拟器，用于建模高阶依赖关系，并引入LLM提议采样，生成有依据的提议分布，无需拒绝采样即可提升采样效率。通过在摘要统计空间内最小化差异，迭代合成循环在逐步揭示并精炼潜在生成结构的同时，实现了真实数据与合成数据的对齐。我们在隐私敏感领域（如电子商务、人口统计和移动性）的异构数据集上，对LLMSynthor进行了控制环境和真实场景的评估，这些数据集涵盖了结构化和非结构化格式。LLMSynthor生成的合成数据展现出高统计保真度、实际应用价值及跨数据适应性，使其成为经济学、社会科学、城市研究等多个领域的宝贵工具。

English

Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.