大语言模型能否清理你的数据乱局?基于LLM的应用就绪数据准备技术综述
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
January 22, 2026
作者: Wei Zhou, Jun Zhou, Haoyu Wang, Zhenghao Li, Qikang He, Shaokun Han, Guoliang Li, Xuanhe Zhou, Yeye He, Chunwei Liu, Zirui Tang, Bin Wang, Shen Tang, Kai Zuo, Yuyu Luo, Zhenzhe Zheng, Conghui He, Jingren Zhou, Fan Wu
cs.AI
摘要
数据准备旨在对原始数据集进行去噪处理、揭示跨数据集关联并从中提取有价值洞见,这对各类以数据为中心的应用至关重要。在三大驱动力推动下——(i)对应用就绪数据(如用于分析、可视化、决策)需求的增长,(ii)日益强大的大语言模型技术,以及(iii)支持灵活智能体构建的基础设施涌现(如基于Databricks Unity Catalog)——采用大语言模型增强的数据准备方法正迅速成为变革性且可能主导的新范式。本文通过梳理数百篇近期文献,对这一演进中的领域进行系统性综述,重点关注大语言模型技术如何为多样化下游任务进行数据准备。首先,我们阐释了从基于规则的模型专用流水线,向提示驱动、情境感知的智能体化工作流程的根本性范式转变。接着提出以任务为核心的分类体系,将领域划分为三大核心任务:数据清洗(如标准化、错误处理、缺失值填补)、数据集成(如实体匹配、模式匹配)与数据增强(如数据标注、画像分析)。针对每类任务,我们综述代表性技术,并着重分析其优势(如提升泛化能力、语义理解能力)与局限(如大语言模型扩展的过高成本、先进智能体中仍存在的幻觉问题、前沿方法与薄弱评估之间的脱节)。此外,我们系统梳理了常用数据集与评估指标(实证部分)。最后,探讨了开放研究挑战,并勾勒出前瞻性发展路线图,重点强调可扩展的大语言模型-数据系统、可靠智能体工作流程的规范化设计以及鲁棒的评估协议。
English
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them, which is essential for a wide range of data-centric applications. Driven by (i) rising demands for application-ready data (e.g., for analytics, visualization, decision-making), (ii) increasingly powerful LLM techniques, and (iii) the emergence of infrastructures that facilitate flexible agent construction (e.g., using Databricks Unity Catalog), LLM-enhanced methods are rapidly becoming a transformative and potentially dominant paradigm for data preparation.
By investigating hundreds of recent literature works, this paper presents a systematic review of this evolving landscape, focusing on the use of LLM techniques to prepare data for diverse downstream tasks. First, we characterize the fundamental paradigm shift, from rule-based, model-specific pipelines to prompt-driven, context-aware, and agentic preparation workflows. Next, we introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning (e.g., standardization, error processing, imputation), data integration (e.g., entity matching, schema matching), and data enrichment (e.g., data annotation, profiling). For each task, we survey representative techniques, and highlight their respective strengths (e.g., improved generalization, semantic understanding) and limitations (e.g., the prohibitive cost of scaling LLMs, persistent hallucinations even in advanced agents, the mismatch between advanced methods and weak evaluation). Moreover, we analyze commonly used datasets and evaluation metrics (the empirical part). Finally, we discuss open research challenges and outline a forward-looking roadmap that emphasizes scalable LLM-data systems, principled designs for reliable agentic workflows, and robust evaluation protocols.