당신의 LLMs 코드는 어떻게 수행되나요? 고품질 데이터로 코드 지시 조정 강화하기

초록

최근에는 더 나은 코드 명령어 조정 데이터를 구축하는 방법에 대한 연구에 대한 관심이 증가하고 있습니다. 그러나 이러한 데이터셋으로 훈련된 코드 모델은 HumanEval에서 높은 성능을 보이지만 LiveCodeBench와 같은 다른 벤치마크에서는 성능이 떨어지는 것을 관찰합니다. 추가 조사를 통해 많은 데이터셋이 심각한 데이터 누출 문제를 겪고 있음을 발견합니다. 누출된 데이터 대부분을 정리한 후에도 일부 잘 알려진 고품질 데이터셋은 성능이 저하됩니다. 이 발견은 고품질 코드 명령어 데이터로 진정으로 자격이 있는 데이터셋을 식별하는 새로운 도전을 드러냅니다. 이에 대응하기 위해 우리는 좋은 샘플을 선택하기 위한 효율적인 코드 데이터 가지치기 전략을 제안합니다. 우리의 접근 방식은 명령어 복잡성, 응답 품질 및 명령어 다양성 세 가지 차원을 기반으로 합니다. 선택한 데이터를 기반으로 LLaMA3에서 파인튜닝된 XCoder라는 모델 패밀리를 제시합니다. 실험 결과, XCoder는 더 적은 훈련 데이터를 사용하여 새로운 최고 수준의 성능을 달성하며, 이는 우리의 데이터 전략의 효과를 확인합니다. 더불어, 데이터 구성에 대한 포괄적인 분석을 수행하고 기존 코드 데이터셋이 구축 방법에 따라 다른 특성을 가지고 있음을 발견하여, 미래 코드 LLMs에 대한 새로운 통찰을 제공합니다. 우리의 모델과 데이터셋은 https://github.com/banksy23/XCoder에서 공개되어 있습니다.

English

Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in https://github.com/banksy23/XCoder

당신의 LLMs 코드는 어떻게 수행되나요? 고품질 데이터로 코드 지시 조정 강화하기

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

초록

Support