你的 LLM 程式碼表現如何？以高質量數據強化程式碼指導調整

摘要

最近，學術界對如何建構更好的程式碼指令調整數據表現出了日益濃厚的興趣。然而，我們觀察到使用這些數據集訓練的程式碼模型在 HumanEval 上表現出色，但在 LiveCodeBench 等其他基準測試中表現較差。經進一步調查後，我們發現許多數據集存在嚴重的數據泄漏問題。在清理了大部分泄漏數據後，一些眾所周知的高質量數據集表現不佳。這一發現揭示了一個新挑戰：辨識哪些數據集真正符合高質量程式碼指令數據的資格。為了解決這個問題，我們提出了一種有效的程式碼數據修剪策略來選擇優質樣本。我們的方法基於三個維度：指令複雜度、回應質量和指令多樣性。基於我們選定的數據，我們提出了 XCoder，這是一系列從 LLaMA3 微調而來的模型。我們的實驗表明，XCoder 在使用更少的訓練數據的情況下實現了新的最先進性能，驗證了我們數據策略的有效性。此外，我們對數據組成進行了全面分析，發現現有的程式碼數據集根據其構建方法具有不同特徵，為未來程式碼 LLMs 提供了新的見解。我們的模型和數據集已在 https://github.com/banksy23/XCoder 上發布。

English

Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a family of models finetuned from LLaMA3. Our experiments show XCoder achieves new state-of-the-art performance using fewer training data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs. Our models and dataset are released in https://github.com/banksy23/XCoder

你的 LLM 程式碼表現如何？以高質量數據強化程式碼指導調整

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

摘要

Support