用于编程教育的生成式人工智能:对ChatGPT、GPT-4和人类导师进行基准测试。
Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors
June 29, 2023
作者: Tung Phung, Victor-Alexandru Pădurean, José Cambronero, Sumit Gulwani, Tobias Kohn, Rupak Majumdar, Adish Singla, Gustavo Soares
cs.AI
摘要
生成式人工智能和大型语言模型在增强计算机教育方面具有巨大潜力,可为初级编程提供下一代教育技术支持。最近的研究已经探讨了这些模型在与编程教育相关的不同场景中的应用;然而,由于通常考虑的是已经过时的模型或仅具体情景,这些研究存在一些限制。因此,目前缺乏一个系统性研究来对一系列编程教育场景中的最新模型进行基准测试。在我们的研究中,我们系统评估了两个模型,ChatGPT(基于GPT-3.5)和GPT-4,并将它们与人类导师在各种场景下的表现进行比较。我们使用五个初级Python编程问题和来自在线平台的真实错误程序进行评估,并利用基于专家的注释来评估性能。我们的结果显示,GPT-4明显优于ChatGPT(基于GPT-3.5),并在几个场景中接近人类导师的表现。这些结果还突显了GPT-4仍然存在困难的情景,为未来改进这些模型性能的技术提供了令人兴奋的方向。
English
Generative AI and large language models hold great promise in enhancing
computing education by powering next-generation educational technologies for
introductory programming. Recent works have studied these models for different
scenarios relevant to programming education; however, these works are limited
for several reasons, as they typically consider already outdated models or only
specific scenario(s). Consequently, there is a lack of a systematic study that
benchmarks state-of-the-art models for a comprehensive set of programming
education scenarios. In our work, we systematically evaluate two models,
ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human
tutors for a variety of scenarios. We evaluate using five introductory Python
programming problems and real-world buggy programs from an online platform, and
assess performance using expert-based annotations. Our results show that GPT-4
drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human
tutors' performance for several scenarios. These results also highlight
settings where GPT-4 still struggles, providing exciting future directions on
developing techniques to improve the performance of these models.