ChatPaper.aiChatPaper

光标核心:通过对齐任何内容辅助编程

CursorCore: Assist Programming through Aligning Anything

October 9, 2024
作者: Hao Jiang, Qi Liu, Rui Li, Shengyu Ye, Shijin Wang
cs.AI

摘要

大型语言模型已成功应用于编程辅助任务,如代码补全、代码插入和指导性代码编辑。然而,这些应用仍然不够自动化,并且在编程过程中有效整合各种类型的信息方面存在困难,包括编码历史、当前代码和用户指令。在这项工作中,我们提出了一个新的对话框架,全面整合这些信息源,收集数据来训练我们的模型并评估它们的性能。首先,为了全面评估模型与不同类型信息的对齐程度以及其输出质量,我们引入了一个新的基准,名为 APEval(辅助编程评估),以全面评估模型在编程辅助任务中的性能。然后,为了数据收集,我们开发了一个数据生成管道 Programming-Instruct,从各种来源(如 GitHub 和在线评判平台)综合合成训练数据。该管道可以自动生成编程过程中的各种类型消息。最后,利用这个管道,我们生成了 219K 个样本,微调多个模型,并开发了 CursorCore 系列。我们展示了 CursorCore 在性能上优于其他相同规模的模型。这个框架统一了内联聊天和自动化编辑等应用,有助于编程助手的进步。代码、模型和数据可在 https://github.com/TechxGenus/CursorCore 免费获取。
English
Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, current code, and user instructions. In this work, we propose a new conversational framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants. Code, models and data are freely available at https://github.com/TechxGenus/CursorCore.

Summary

AI-Generated Summary

PDF132November 16, 2024