游標核心:透過對齊任何事物協助編程
CursorCore: Assist Programming through Aligning Anything
October 9, 2024
作者: Hao Jiang, Qi Liu, Rui Li, Shengyu Ye, Shijin Wang
cs.AI
摘要
大型語言模型已成功應用於程式設計輔助任務,如程式碼自動完成、程式碼插入和指導性程式碼編輯。然而,這些應用仍然缺乏自動化,並在程式設計過程中難以有效整合各種類型的資訊,包括編碼歷史、當前程式碼和使用者指示。在這項工作中,我們提出了一個新的對話框架,全面整合這些資訊來源,收集數據來訓練我們的模型並評估其性能。首先,為了全面評估模型與不同類型資訊的對齊程度和其輸出的質量,我們引入了一個新的基準,名為 APEval(Assist Programming Eval),以全面評估模型在程式設計輔助任務中的表現。然後,為了進行數據收集,我們開發了一個數據生成管道 Programming-Instruct,從各種來源(如 GitHub 和線上評判平台)綜合合成訓練數據。該管道可以在整個程式設計過程中自動生成各種類型的訊息。最後,利用這個管道,我們生成了 219K 個樣本,對多個模型進行微調,並開發了 CursorCore 系列。我們展示了 CursorCore 在性能上優於其他相近大小的模型。這個框架統一了內聯聊天和自動編輯等應用,有助於程式設計助手的進步。程式碼、模型和數據可在以下網址免費取得:https://github.com/TechxGenus/CursorCore。
English
Large language models have been successfully applied to programming
assistance tasks, such as code completion, code insertion, and instructional
code editing. However, these applications remain insufficiently automated and
struggle to effectively integrate various types of information during the
programming process, including coding history, current code, and user
instructions. In this work, we propose a new conversational framework that
comprehensively integrates these information sources, collect data to train our
models and evaluate their performance. Firstly, to thoroughly evaluate how well
models align with different types of information and the quality of their
outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to
comprehensively assess the performance of models in programming assistance
tasks. Then, for data collection, we develop a data generation pipeline,
Programming-Instruct, which synthesizes training data from diverse sources,
such as GitHub and online judge platforms. This pipeline can automatically
generate various types of messages throughout the programming process. Finally,
using this pipeline, we generate 219K samples, fine-tune multiple models, and
develop the CursorCore series. We show that CursorCore outperforms other models
of comparable size. This framework unifies applications such as inline chat and
automated editing, contributes to the advancement of coding assistants. Code,
models and data are freely available at
https://github.com/TechxGenus/CursorCore.Summary
AI-Generated Summary