ChatPaper.aiChatPaper

协作评估工具:评估LLM引导的软件编程

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming

February 22, 2024
作者: Anisha Agarwal, Aaron Chan, Shubham Chandel, Jinu Jang, Shaun Miller, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Neel Sundaresan, Michele Tufano
cs.AI

摘要

将大型语言模型(LLMs)集成到开发环境(IDEs)已成为现代软件开发的焦点。诸如OpenAI GPT-3.5/4和Code Llama之类的LLMs具有潜力通过充当智能、基于聊天的编程助手,显著提高开发人员的生产力。然而,直接使用LLMs可能不够优化适用于任何给定情景。相反,每个系统都需要对LLMs进行调整以适应其启发式集,以确保获得最佳性能。在本文中,我们介绍了Copilot评估工具套件:这是一组用于评估LLM引导的IDE交互的数据和工具,涵盖各种编程情景和语言。我们提出的度量标准比以往的最先进评估系统更加稳健和信息密集。我们为涵盖广泛开发人员任务范围的情景设计和计算静态和基于执行的成功度量标准,包括从自然语言生成代码(generate)、从代码生成文档(doc)、生成测试用例(test)、修复错误(fix)以及理解工作空间和解决查询(workspace)。这些成功度量标准旨在评估LLMs在给定IDE及其相应参数空间内的性能。我们通过使用这些度量标准评估三种常见LLMs的经验可以为LLM引导的IDE中未来情景的开发和验证提供指导。
English
The integration of Large Language Models (LLMs) into Development Environments (IDEs) has become a focal point in modern software development. LLMs such as OpenAI GPT-3.5/4 and Code Llama offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants. However, utilizing LLMs out of the box is unlikely to be optimal for any given scenario. Rather, each system requires the LLM to be honed to its set of heuristics to ensure the best performance. In this paper, we introduce the Copilot evaluation harness: a set of data and tools for evaluating LLM-guided IDE interactions, covering various programming scenarios and languages. We propose our metrics as a more robust and information-dense evaluation than previous state of the art evaluation systems. We design and compute both static and execution based success metrics for scenarios encompassing a wide range of developer tasks, including code generation from natural language (generate), documentation generation from code (doc), test case generation (test), bug-fixing (fix), and workspace understanding and query resolution (workspace). These success metrics are designed to evaluate the performance of LLMs within a given IDE and its respective parameter space. Our learnings from evaluating three common LLMs using these metrics can inform the development and validation of future scenarios in LLM guided IDEs.
PDF111December 15, 2024