코딩할 것인가, 코딩하지 말아야 할까? 사전 훈련에서 코드의 영향 탐구

초록

사전 훈련 데이터 혼합물에 코드를 포함하는 것은 코드에 특별히 설계되지 않은 모델들에 대해서도 LLMs 사전 훈련에서 흔한 실천 방법이 되었습니다. 실무자들 사이에는 코드 데이터가 일반 LLMs의 성능에 중요한 역할을 한다는 견해가 있었지만, 코드가 비코드 작업에 미치는 정확한 영향을 분석한 연구는 제한적입니다. 본 연구에서는 코드 데이터가 일반 성능에 미치는 영향을 체계적으로 조사합니다. 우리는 "사전 훈련에 사용된 코드 데이터가 코드 생성 이외의 다양한 하위 작업에 미치는 영향은 무엇인가"라는 질문을 제기합니다. 우리는 광범위한 자연어 추론 작업, 세계 지식 작업, 코드 벤치마크, 그리고 470M에서 2.8B 매개변수 크기의 모델에 대한 LLM-판사로서의 승률을 포함한 폭넓은 범위의 실험적 연구를 수행하고 평가합니다. 다양한 설정에서, 우리는 코드가 코딩 작업을 넘어서 일반화에 중요한 구성 요소임을 일관된 결과로 발견했으며, 코드 품질의 향상이 모든 작업에 상당한 영향을 미침을 확인했습니다. 특히, 텍스트만을 사용한 사전 훈련과 비교했을 때, 코드 추가는 자연어 추론에서 최대 8.2%의 상대적 증가, 세계 지식에서 4.2%의 향상, 생성적 승률에서 6.6%의 향상, 그리고 코드 성능에서 12배의 향상을 가져옵니다. 우리의 연구는 코드 품질에 대한 투자와 사전 훈련 중 코드 보존이 긍정적인 영향을 미친다는 점을 시사합니다.

English

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a vital role in general LLMs' performance, there is only limited work analyzing the precise impact of code on non-code tasks. In this work, we systematically investigate the impact of code data on general performance. We ask "what is the impact of code data used in pre-training on a large variety of downstream tasks beyond code generation". We conduct extensive ablations and evaluate across a broad range of natural language reasoning tasks, world knowledge tasks, code benchmarks, and LLM-as-a-judge win-rates for models with sizes ranging from 470M to 2.8B parameters. Across settings, we find a consistent results that code is a critical building block for generalization far beyond coding tasks and improvements to code quality have an outsized impact across all tasks. In particular, compared to text-only pre-training, the addition of code results in up to relative increase of 8.2% in natural language (NL) reasoning, 4.2% in world knowledge, 6.6% improvement in generative win-rates, and a 12x boost in code performance respectively. Our work suggests investments in code quality and preserving code during pre-training have positive impacts.

코딩할 것인가, 코딩하지 말아야 할까? 사전 훈련에서 코드의 영향 탐구

To Code, or Not To Code? Exploring Impact of Code in Pre-training

초록

Support