大型語言模型能為你收拾殘局嗎?基於LLM的應用就緒型資料整備技術綜述
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
January 22, 2026
作者: Wei Zhou, Jun Zhou, Haoyu Wang, Zhenghao Li, Qikang He, Shaokun Han, Guoliang Li, Xuanhe Zhou, Yeye He, Chunwei Liu, Zirui Tang, Bin Wang, Shen Tang, Kai Zuo, Yuyu Luo, Zhenzhe Zheng, Conghui He, Jingren Zhou, Fan Wu
cs.AI
摘要
資料準備旨在對原始資料集進行去噪、揭示跨資料集關聯性,並從中提取有價值的洞察,這對各類以資料為核心的應用至關重要。在三大驅動力推動下:(i)對應用就緒型資料(如用於分析、可視化、決策支援)的需求日益增長,(ii)日益強大的大型語言模型技術發展,以及(iii)促進靈活智能體構建的基礎設施湧現(例如基於Databricks Unity Catalog的應用),採用LLM增強技術的資料準備方法正快速成為變革性且可能主導未來的新範式。
本文透過系統性研究數百篇近期文獻,對此發展中領域進行全面綜述,重點探討如何運用LLM技術為多元下游任務準備資料。首先,我們闡釋從基於規則、模型特定的管線,向提示驅動、情境感知且具能動性的準備工作流轉變的根本範式遷移。接著提出以任務為核心的分類框架,將該領域劃分為三大主要任務:資料清理(如標準化、錯誤處理、插補)、資料整合(如實體匹配、模式匹配)與資料增強(如資料標註、剖析)。針對每類任務,我們評述代表性技術,並凸顯其優勢(如提升泛化能力、語義理解)與侷限性(如LLM擴展的昂貴成本、先進智能體中仍存在的幻覺問題、先進方法與薄弱評估間的不匹配)。
此外,我們分析常用資料集與評估指標(實證研究部分)。最後探討開放性研究挑戰,並勾勒出前瞻性發展路線圖,重點關注可擴展的LLM-資料系統、可靠智能體工作流的原則性設計,以及強健的評估協議。
English
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them, which is essential for a wide range of data-centric applications. Driven by (i) rising demands for application-ready data (e.g., for analytics, visualization, decision-making), (ii) increasingly powerful LLM techniques, and (iii) the emergence of infrastructures that facilitate flexible agent construction (e.g., using Databricks Unity Catalog), LLM-enhanced methods are rapidly becoming a transformative and potentially dominant paradigm for data preparation.
By investigating hundreds of recent literature works, this paper presents a systematic review of this evolving landscape, focusing on the use of LLM techniques to prepare data for diverse downstream tasks. First, we characterize the fundamental paradigm shift, from rule-based, model-specific pipelines to prompt-driven, context-aware, and agentic preparation workflows. Next, we introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning (e.g., standardization, error processing, imputation), data integration (e.g., entity matching, schema matching), and data enrichment (e.g., data annotation, profiling). For each task, we survey representative techniques, and highlight their respective strengths (e.g., improved generalization, semantic understanding) and limitations (e.g., the prohibitive cost of scaling LLMs, persistent hallucinations even in advanced agents, the mismatch between advanced methods and weak evaluation). Moreover, we analyze commonly used datasets and evaluation metrics (the empirical part). Finally, we discuss open research challenges and outline a forward-looking roadmap that emphasizes scalable LLM-data systems, principled designs for reliable agentic workflows, and robust evaluation protocols.