**數據驅動編程:基於原始語料庫實現自我改進大型語言模型的測試驅動數據工程**
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
April 27, 2026
作者: Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan
cs.AI
摘要
將專業人類知識從文本可靠地遷移至大型語言模型,仍是人工智慧領域的根本性挑戰。基於領域語料庫的微調雖能顯著提升模型能力,但該過程缺乏回饋機制:當模型在領域任務中失敗時,無法診斷訓練資料的缺陷所在,唯一手段只能盲目增加資料量。本文提出,當從源語料庫提取的結構化知識表徵作為訓練資料與評估的共同基礎時,完整的資料工程生命週期可精確對應至軟體開發生命週期:訓練資料成為規定模型應學內容的原始碼,模型訓練相當於編譯過程,基準測試如同單元測試,而基於失敗案例的資料修復則類比於除錯。在此對應框架下,模型失敗可分解為概念層面的缺失與推理鏈斷裂,並能追溯至資料的具體缺陷,透過定向修補進行修復。每個修復週期皆能在不同模型規模與架構下實現持續改進,且不損害通用能力。我們將此原則形式化為「資料程式設計」,並在自然科學、工程、生物醫學與社會科學等十六個學科中實現該框架,同步開源釋出結構化知識庫、基準測試套件與訓練語料庫。通過證實訓練資料與模型行為間存在可追溯的結構化關聯且具系統化修復能力,本研究為人類專業知識的可靠遷移奠定了理論基礎。
English
Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing consistent improvements across model scales and architectures without degrading general capabilities. We formalize this principle as Programming with Data and instantiate it across sixteen disciplines spanning the natural sciences, engineering, biomedicine, and the social sciences, releasing a structured knowledge base, benchmark suite, and training corpus as open resources. By demonstrating that the relationship between training data and model behaviour is structurally traceable and systematically repairable, this work establishes a principled foundation for the reliable engineering of human expertise into language models.