協作夥伴評估工具:評估以LLM引導的軟體程式設計
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
February 22, 2024
作者: Anisha Agarwal, Aaron Chan, Shubham Chandel, Jinu Jang, Shaun Miller, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Neel Sundaresan, Michele Tufano
cs.AI
摘要
將大型語言模型(LLMs)整合到開發環境(IDEs)已成為現代軟體開發的焦點。像是OpenAI GPT-3.5/4和Code Llama等LLMs提供了顯著增強開發者生產力的潛力,因為它們可以作為智能、基於對話的程式設計助手。然而,直接使用LLMs可能不夠適合特定情境。相反,每個系統都需要調整LLM以符合其啟發式集,以確保最佳效能。本文介紹Copilot評估工具組:這是一套用於評估LLM引導的IDE互動的資料和工具,涵蓋各種程式設計情境和語言。我們提出的評量指標比先前最先進的評估系統更為全面和資訊密集。我們為多種開發者任務設計並計算靜態和執行基礎的成功指標,包括從自然語言生成程式碼(生成)、從程式碼生成文件(文件)、測試案例生成(測試)、錯誤修復(修復)以及工作區理解和查詢解決(工作區)。這些成功指標旨在評估LLMs在特定IDE及其相應參數空間中的表現。我們從使用這些指標評估三個常見LLMs中獲得的經驗,可為LLM引導的IDE中未來情境的開發和驗證提供參考。
English
The integration of Large Language Models (LLMs) into Development Environments
(IDEs) has become a focal point in modern software development. LLMs such as
OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment
developer productivity by serving as intelligent, chat-driven programming
assistants. However, utilizing LLMs out of the box is unlikely to be optimal
for any given scenario. Rather, each system requires the LLM to be honed to its
set of heuristics to ensure the best performance. In this paper, we introduce
the Copilot evaluation harness: a set of data and tools for evaluating
LLM-guided IDE interactions, covering various programming scenarios and
languages. We propose our metrics as a more robust and information-dense
evaluation than previous state of the art evaluation systems. We design and
compute both static and execution based success metrics for scenarios
encompassing a wide range of developer tasks, including code generation from
natural language (generate), documentation generation from code (doc), test
case generation (test), bug-fixing (fix), and workspace understanding and query
resolution (workspace). These success metrics are designed to evaluate the
performance of LLMs within a given IDE and its respective parameter space. Our
learnings from evaluating three common LLMs using these metrics can inform the
development and validation of future scenarios in LLM guided IDEs.