プログラミング教育における生成AI：ChatGPT、GPT-4、人間のチューターのベンチマーク

要旨

生成AIと大規模言語モデルは、次世代の教育技術を支えることで、初級プログラミング教育の強化に大きな可能性を秘めています。最近の研究では、プログラミング教育に関連するさまざまなシナリオにおいてこれらのモデルが検討されてきました。しかし、これらの研究はいくつかの理由で限定的であり、通常はすでに時代遅れのモデルや特定のシナリオのみを対象としています。その結果、最新のモデルを包括的なプログラミング教育シナリオでベンチマークする体系的な研究が不足しています。本研究では、ChatGPT（GPT-3.5ベース）とGPT-4の2つのモデルを体系的に評価し、さまざまなシナリオにおいて人間のチューターと比較します。評価には、初級Pythonプログラミングの問題5問とオンラインプラットフォームからの実世界のバグを含むプログラムを使用し、専門家による注釈を用いてパフォーマンスを測定します。結果は、GPT-4がChatGPT（GPT-3.5ベース）を大幅に上回り、いくつかのシナリオでは人間のチューターに近い性能を示すことを明らかにしています。また、GPT-4がまだ苦戦する場面も浮き彫りになり、これらのモデルの性能を向上させる技術開発に向けた今後の興味深い方向性を提供しています。

English

Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies for introductory programming. Recent works have studied these models for different scenarios relevant to programming education; however, these works are limited for several reasons, as they typically consider already outdated models or only specific scenario(s). Consequently, there is a lack of a systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios. We evaluate using five introductory Python programming problems and real-world buggy programs from an online platform, and assess performance using expert-based annotations. Our results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors' performance for several scenarios. These results also highlight settings where GPT-4 still struggles, providing exciting future directions on developing techniques to improve the performance of these models.

プログラミング教育における生成AI：ChatGPT、GPT-4、人間のチューターのベンチマーク

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

要旨

Support