用于编程教育的生成式人工智能：对ChatGPT、GPT-4和人类导师进行基准测试。

摘要

生成式人工智能和大型语言模型在增强计算机教育方面具有巨大潜力，可为初级编程提供下一代教育技术支持。最近的研究已经探讨了这些模型在与编程教育相关的不同场景中的应用；然而，由于通常考虑的是已经过时的模型或仅具体情景，这些研究存在一些限制。因此，目前缺乏一个系统性研究来对一系列编程教育场景中的最新模型进行基准测试。在我们的研究中，我们系统评估了两个模型，ChatGPT（基于GPT-3.5）和GPT-4，并将它们与人类导师在各种场景下的表现进行比较。我们使用五个初级Python编程问题和来自在线平台的真实错误程序进行评估，并利用基于专家的注释来评估性能。我们的结果显示，GPT-4明显优于ChatGPT（基于GPT-3.5），并在几个场景中接近人类导师的表现。这些结果还突显了GPT-4仍然存在困难的情景，为未来改进这些模型性能的技术提供了令人兴奋的方向。

English

Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.

用于编程教育的生成式人工智能：对ChatGPT、GPT-4和人类导师进行基准测试。

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

摘要

Support