大型語言模型於數據合成之應用

摘要

生成能夠忠實捕捉現實世界分佈統計結構的合成數據，是數據建模中的一個根本性挑戰。傳統方法通常依賴於強參數假設或手動結構設計，在高維或異構領域中表現欠佳。大型語言模型（LLMs）的最新進展揭示了其作為靈活、高維的現實世界分佈先驗的潛力。然而，當應用於數據合成時，基於LLM的標準採樣效率低下，受固定上下文限制，且無法確保統計對齊。鑑於此，我們引入了LLMSynthor，這是一個通用的數據合成框架，將LLMs轉化為由分佈反饋引導的結構感知模擬器。LLMSynthor將LLM視為非參數的copula模擬器，用於建模高階依賴關係，並引入LLM提案採樣來生成基於實際的提案分佈，從而提高採樣效率，無需拒絕採樣。通過在摘要統計空間中最小化差異，迭代合成循環在逐步揭示和精煉潛在生成結構的同時，對齊真實與合成數據。我們在隱私敏感領域（如電子商務、人口和移動性）的異構數據集上，對LLMSynthor進行了控制和現實環境的評估，這些數據集涵蓋了結構化和非結構化格式。LLMSynthor生成的合成數據展現出高統計保真度、實用價值和跨數據適應性，使其成為經濟學、社會科學、城市研究等領域的寶貴工具。

English

Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.