데이터 기반 프로그래밍: 원시 코퍼스에서 자기 발전하는 대규모 언어 모델을 위한 테스트 주도 데이터 엔지니어링

초록

전문적인 인간 지식을 텍스트로부터 대규모 언어 모델로 신뢰성 있게 전달하는 것은 인공 지능의 근본적인 과제로 남아 있습니다. 도메인 코퍼스에 대한 미세 조정은 상당한 능력 향상을 가능하게 했지만, 이 과정은 피드백 없이 진행됩니다: 모델이 도메인 작업에 실패할 때, 학습 데이터의 어떤 부분이 부족한지 진단할 방법이 없으며, 유일한 해결책은 무분별하게 더 많은 데이터를 추가하는 것뿐입니다. 본 연구에서는 원본 코퍼스에서 추출된 구조화된 지식 표현이 학습 데이터와 평가의 공통 기반으로 활용될 때, 완전한 데이터 엔지니어링 생명주기가 소프트웨어 개발 생명주기에 정확하고 실질적인 방식으로 대응됨을 보여줍니다: 학습 데이터는 모델이 학습해야 할 내용을 명시하는 소스 코드가 되고, 모델 학습은 컴파일이 되며, 벤치마킹은 단위 테스트가 되고, 실패 기반 데이터 수정은 디버깅이 됩니다. 이 대응 관계 아래에서 모델 실패는 개념 수준의 격차와 추론 체인 단절로 분해되어 데이터의 특정 결함으로 추적될 수 있으며, 표적 패치를 통해 수정될 수 있습니다. 각 수정 주기는 일반적인 능력을 저하시키지 않으면서 모델 규모와 아키텍처에 걸쳐 일관된 개선을 생산합니다. 우리는 이 원리를 '데이터를 이용한 프로그래밍(Programming with Data)'으로 공식화하고, 자연과학, 공학, 생명의학, 사회과학에 이르는 16개 분야 전반에 걸쳐 이를 구현하며, 구조화된 지식 베이스, 벤치마크 모음, 학습 코퍼스를 공개 자원으로 출시합니다. 학습 데이터와 모델 행동 간의 관계가 구조적으로 추적 가능하고 체계적으로 수정 가능함을 입증함으로써, 이 연구는 인간 전문 지식을 언어 모델에 신뢰성 있게 구현하기 위한 원칙적인 기초를 마련합니다.

English

Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that when a structured knowledge representation extracted from the source corpus serves as the shared foundation for both training data and evaluation, the complete data-engineering lifecycle maps onto the software development lifecycle in a precise and operative way: training data becomes source code specifying what the model should learn, model training becomes compilation, benchmarking becomes unit testing, and failure-driven data repair becomes debugging. Under this correspondence, model failures decompose into concept-level gaps and reasoning-chain breaks that can be traced back to specific deficiencies in the data and repaired through targeted patches, with each repair cycle producing consistent improvements across model scales and architectures without degrading general capabilities. We formalize this principle as Programming with Data and instantiate it across sixteen disciplines spanning the natural sciences, engineering, biomedicine, and the social sciences, releasing a structured knowledge base, benchmark suite, and training corpus as open resources. By demonstrating that the relationship between training data and model behaviour is structurally traceable and systematically repairable, this work establishes a principled foundation for the reliable engineering of human expertise into language models.

데이터 기반 프로그래밍: 원시 코퍼스에서 자기 발전하는 대규모 언어 모델을 위한 테스트 주도 데이터 엔지니어링

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

초록

Support