ChatPaper.aiChatPaper

数据驱动编程:基于原始语料库的自改进大语言模型测试驱动数据工程

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

April 27, 2026
作者: Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan
cs.AI

摘要

如何可靠地将专业的人类知识从文本迁移至大语言模型,始终是人工智能领域的核心挑战。基于领域语料库的微调虽能显著提升模型能力,但该过程缺乏反馈机制:当模型在领域任务中失败时,我们无法诊断训练数据存在何种缺陷,唯一对策只能是盲目追加数据。本文提出,当从源语料提取的结构化知识表征同时作为训练数据与评估的共同基础时,完整的数据工程生命周期可与软件开发生命周期形成精确的操作性映射:训练数据转化为规定模型应学内容的源代码,模型训练相当于编译过程,基准测试如同单元测试,而基于失败的数据修复则对应调试环节。在此对应关系下,模型失败可分解为概念层面的缺失和推理链断裂,并能追溯至数据中的具体缺陷,通过定向修补进行修复。每个修复周期都能在不同模型规模和架构上实现持续改进,且不损害通用能力。我们将这一原理形式化为"数据编程",并在自然科学、工程学、生物医学和社会科学等十六个学科中实现该框架,同步开源了结构化知识库、基准测试套件和训练语料库。通过证明训练数据与模型行为之间存在可追溯、可系统性修复的结构化关联,本研究为将人类专业知识可靠地工程化注入语言模型奠定了理论基础。
English
Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing consistent improvements across model scales and architectures without degrading general capabilities. We formalize this principle as Programming with Data and instantiate it across sixteen disciplines spanning the natural sciences, engineering, biomedicine, and the social sciences, releasing a structured knowledge base, benchmark suite, and training corpus as open resources. By demonstrating that the relationship between training data and model behaviour is structurally traceable and systematically repairable, this work establishes a principled foundation for the reliable engineering of human expertise into language models.
PDF702April 30, 2026