데이터 합성을 위한 대형 언어 모델

초록

실제 세계 분포의 통계적 구조를 충실히 반영하는 합성 데이터를 생성하는 것은 데이터 모델링에서 근본적인 과제입니다. 기존의 접근 방식은 강력한 매개변수적 가정이나 수동적인 구조 설계에 의존하며, 고차원적이거나 이질적인 영역에서는 어려움을 겪습니다. 최근 대형 언어 모델(LLM)의 발전은 이를 실제 세계 분포에 대한 유연하고 고차원적인 사전 분포로 활용할 가능성을 보여주고 있습니다. 그러나 데이터 합성에 적용할 때, 표준 LLM 기반 샘플링은 비효율적이며 고정된 컨텍스트 제한에 의해 제약을 받고, 통계적 정렬을 보장하지 못합니다. 이를 고려하여, 우리는 LLMSynthor를 소개합니다. 이는 분포적 피드백에 의해 구조를 인지하는 시뮬레이터로 LLM을 변환하는 일반적인 데이터 합성 프레임워크입니다. LLMSynthor는 LLM을 고차원적 의존성을 모델링하기 위한 비모수적 코플라 시뮬레이터로 취급하고, LLM 제안 샘플링을 도입하여 거부 없이 샘플링 효율성을 향상시키는 근거 있는 제안 분포를 생성합니다. 요약 통계 공간에서의 불일치를 최소화함으로써, 반복적인 합성 루프는 실제 데이터와 합성 데이터를 정렬하면서 잠재적 생성 구조를 점차적으로 발견하고 개선합니다. 우리는 LLMSynthor를 프라이버시 민감한 영역(예: 전자상거래, 인구, 이동성)에서 구조화 및 비구조화된 형식을 포함한 이질적인 데이터셋을 사용하여 통제된 환경과 실제 환경에서 평가합니다. LLMSynthor가 생성한 합성 데이터는 높은 통계적 충실도, 실용적 유용성, 그리고 데이터 간 적응성을 보여주며, 이를 경제학, 사회과학, 도시 연구 등 다양한 분야에서 가치 있는 도구로 자리매김합니다.

English

Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.