協作夥伴評估工具：評估以LLM引導的軟體程式設計

摘要

將大型語言模型（LLMs）整合到開發環境（IDEs）已成為現代軟體開發的焦點。像是OpenAI GPT-3.5/4和Code Llama等LLMs提供了顯著增強開發者生產力的潛力，因為它們可以作為智能、基於對話的程式設計助手。然而，直接使用LLMs可能不夠適合特定情境。相反，每個系統都需要調整LLM以符合其啟發式集，以確保最佳效能。本文介紹Copilot評估工具組：這是一套用於評估LLM引導的IDE互動的資料和工具，涵蓋各種程式設計情境和語言。我們提出的評量指標比先前最先進的評估系統更為全面和資訊密集。我們為多種開發者任務設計並計算靜態和執行基礎的成功指標，包括從自然語言生成程式碼（生成）、從程式碼生成文件（文件）、測試案例生成（測試）、錯誤修復（修復）以及工作區理解和查詢解決（工作區）。這些成功指標旨在評估LLMs在特定IDE及其相應參數空間中的表現。我們從使用這些指標評估三個常見LLMs中獲得的經驗，可為LLM引導的IDE中未來情境的開發和驗證提供參考。

English

The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs.

協作夥伴評估工具：評估以LLM引導的軟體程式設計

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming

摘要

Support