用於程式設計教育的生成式人工智慧：對ChatGPT、GPT-4和人類導師進行基準測試

摘要

生成式人工智慧和大型語言模型在增強計算教育方面具有巨大潛力，可為入門程式設計提供下一代教育技術。最近的研究已探討這些模型在與程式設計教育相關的不同情境中的應用；然而，這些研究存在一些限制，因為它們通常考慮的是已經過時的模型，或僅涉及特定情境。因此，缺乏對最先進模型在全面一套程式設計教育情境中進行基準測試的系統性研究。在我們的研究中，我們系統性評估了兩個模型，ChatGPT（基於GPT-3.5）和GPT-4，並將它們的表現與人類導師進行比較，涵蓋各種情境。我們使用五個入門級Python程式設計問題和來自線上平台的真實有錯誤的程式，並使用專家標註來評估表現。我們的結果顯示，GPT-4明顯優於ChatGPT（基於GPT-3.5），並在幾個情境中接近人類導師的表現。這些結果還突顯了GPT-4仍然存在困難的情境，為改進這些模型表現的技術提供了令人振奮的未來方向。

English

Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.

用於程式設計教育的生成式人工智慧：對ChatGPT、GPT-4和人類導師進行基準測試

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

摘要

Support