大型語言模型於數據合成之應用
Large Language Models for Data Synthesis
May 20, 2025
作者: Yihong Tang, Menglin Kong, Lijun Sun
cs.AI
摘要
生成能夠忠實捕捉現實世界分佈統計結構的合成數據,是數據建模中的一個根本性挑戰。傳統方法通常依賴於強參數假設或手動結構設計,在高維或異構領域中表現欠佳。大型語言模型(LLMs)的最新進展揭示了其作為靈活、高維的現實世界分佈先驗的潛力。然而,當應用於數據合成時,基於LLM的標準採樣效率低下,受固定上下文限制,且無法確保統計對齊。鑑於此,我們引入了LLMSynthor,這是一個通用的數據合成框架,將LLMs轉化為由分佈反饋引導的結構感知模擬器。LLMSynthor將LLM視為非參數的copula模擬器,用於建模高階依賴關係,並引入LLM提案採樣來生成基於實際的提案分佈,從而提高採樣效率,無需拒絕採樣。通過在摘要統計空間中最小化差異,迭代合成循環在逐步揭示和精煉潛在生成結構的同時,對齊真實與合成數據。我們在隱私敏感領域(如電子商務、人口和移動性)的異構數據集上,對LLMSynthor進行了控制和現實環境的評估,這些數據集涵蓋了結構化和非結構化格式。LLMSynthor生成的合成數據展現出高統計保真度、實用價值和跨數據適應性,使其成為經濟學、社會科學、城市研究等領域的寶貴工具。
English
Generating synthetic data that faithfully captures the statistical structure
of real-world distributions is a fundamental challenge in data modeling.
Classical approaches often depend on strong parametric assumptions or manual
structural design and struggle in high-dimensional or heterogeneous domains.
Recent progress in Large Language Models (LLMs) reveals their potential as
flexible, high-dimensional priors over real-world distributions. However, when
applied to data synthesis, standard LLM-based sampling is inefficient,
constrained by fixed context limits, and fails to ensure statistical alignment.
Given this, we introduce LLMSynthor, a general framework for data synthesis
that transforms LLMs into structure-aware simulators guided by distributional
feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for
modeling high-order dependencies and introduces LLM Proposal Sampling to
generate grounded proposal distributions that improve sampling efficiency
without requiring rejection. By minimizing discrepancies in the summary
statistics space, the iterative synthesis loop aligns real and synthetic data
while gradually uncovering and refining the latent generative structure. We
evaluate LLMSynthor in both controlled and real-world settings using
heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce,
population, and mobility) that encompass both structured and unstructured
formats. The synthetic data produced by LLMSynthor shows high statistical
fidelity, practical utility, and cross-data adaptability, positioning it as a
valuable tool across economics, social science, urban studies, and beyond.