你的L语言模型代码表现如何?利用高质量数据增强代码指导调优
How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data
September 5, 2024
作者: Yejie Wang, Keqing He, Dayuan Fu, Zhuoma Gongque, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang Wang, Mengdi Zhang, Xunliang Cai, Weiran Xu
cs.AI
摘要
最近,研究如何构建更好的代码指令调优数据引起了广泛关注。然而,我们观察到使用这些数据集训练的代码模型在HumanEval上表现出色,但在LiveCodeBench等其他基准测试中表现较差。经进一步调查,我们发现许多数据集存在严重的数据泄漏问题。清理掉大部分泄漏数据后,一些知名高质量数据集的表现却不佳。这一发现揭示了一个新挑战:识别哪些数据集真正符合高质量代码指令数据的标准。为了解决这个问题,我们提出了一种高效的代码数据修剪策略来选择优质样本。我们的方法基于三个维度:指令复杂性、响应质量和指令多样性。基于我们选择的数据,我们提出了XCoder,这是一系列从LLaMA3微调而来的模型。我们的实验表明,XCoder在使用更少的训练数据时实现了新的最先进性能,验证了我们数据策略的有效性。此外,我们对数据组成进行了全面分析,发现现有代码数据集根据构建方法具有不同特征,为未来代码LLM提供了新的见解。我们的模型和数据集已在https://github.com/banksy23/XCoder发布。
English
Recently, there has been a growing interest in studying how to construct
better code instruction tuning data. However, we observe Code models trained
with these datasets exhibit high performance on HumanEval but perform worse on
other benchmarks such as LiveCodeBench. Upon further investigation, we find
that many datasets suffer from severe data leakage. After cleaning up most of
the leaked data, some well-known high-quality datasets perform poorly. This
discovery reveals a new challenge: identifying which dataset genuinely qualify
as high-quality code instruction data. To address this, we propose an efficient
code data pruning strategy for selecting good samples. Our approach is based on
three dimensions: instruction complexity, response quality, and instruction
diversity. Based on our selected data, we present XCoder, a family of models
finetuned from LLaMA3. Our experiments show XCoder achieves new
state-of-the-art performance using fewer training data, which verify the
effectiveness of our data strategy. Moreover, we perform a comprehensive
analysis on the data composition and find existing code datasets have different
characteristics according to their construction methods, which provide new
insights for future code LLMs. Our models and dataset are released in
https://github.com/banksy23/XCoder