ソフトウェアエンジニアリングにおけるAIモデルのベンチマーキング：レビュー、検索ツール、および改善プロトコル

要旨

ベンチマークは、一貫した評価と再現性を確保するために不可欠です。ソフトウェア工学への人工知能の統合（AI4SE）により、コード生成やバグ修正などのタスクに対する数多くのベンチマークが生まれました。しかし、この急増は以下の課題をもたらしています：（1）タスク間で分散したベンチマーク知識、（2）関連するベンチマークの選択の難しさ、（3）ベンチマーク開発の統一標準の欠如、（4）既存ベンチマークの限界。本論文では、173の研究をレビューし、204のAI4SEベンチマークを特定しました。これらのベンチマークを分類し、その限界を分析し、実践におけるギャップを明らかにします。レビューに基づき、関連研究の文脈を自動クラスタリングして、適切なベンチマークを見つけるためのセマンティック検索ツールであるBenchScoutを作成しました。22名の参加者によるユーザー調査を実施し、BenchScoutの使いやすさ、有効性、直感性を評価した結果、それぞれ5点満点中4.5、4.0、4.1の平均スコアを得ました。ベンチマーク標準を進化させるため、ベンチマーク品質を向上させる統一手法であるBenchFrameを提案します。ケーススタディとして、BenchFrameをHumanEvalベンチマークに適用し、その主な限界に対処しました。これにより、（1）誤りの修正、（2）言語変換の改善、（3）テストカバレッジの拡大、（4）難易度の向上を特徴とするHumanEvalNextが生まれました。その後、HumanEval、HumanEvalPlus、HumanEvalNextの3つに対して、最先端のコード言語モデル10種を評価しました。HumanEvalNextでは、モデルのpass@1スコアがHumanEvalと比較して31.22%、HumanEvalPlusと比較して19.94%減少しました。

English

Benchmarks are essential for consistent evaluation and reproducibility. The integration of Artificial Intelligence into Software Engineering (AI4SE) has given rise to numerous benchmarks for tasks such as code generation and bug fixing. However, this surge presents challenges: (1) scattered benchmark knowledge across tasks, (2) difficulty in selecting relevant benchmarks, (3) the absence of a uniform standard for benchmark development, and (4) limitations of existing benchmarks. In this paper, we review 173 studies and identify 204 AI4SE benchmarks. We classify these benchmarks, analyze their limitations, and expose gaps in practices. Based on our review, we created BenchScout, a semantic search tool to find relevant benchmarks, using automated clustering of the contexts from associated studies. We conducted a user study with 22 participants to evaluate BenchScout's usability, effectiveness, and intuitiveness which resulted in average scores of 4.5, 4.0, and 4.1 out of 5. To advance benchmarking standards, we propose BenchFrame, a unified method to enhance benchmark quality. As a case study, we applied BenchFrame to the HumanEval benchmark and addressed its main limitations. This led to HumanEvalNext, featuring (1) corrected errors, (2) improved language conversion, (3) expanded test coverage, and (4) increased difficulty. We then evaluated ten state-of-the-art code language models on HumanEval, HumanEvalPlus, and HumanEvalNext. On HumanEvalNext, models showed a pass@1 score reduction of 31.22% and 19.94% compared to HumanEval and HumanEvalPlus, respectively.

ソフトウェアエンジニアリングにおけるAIモデルのベンチマーキング：レビュー、検索ツール、および改善プロトコル

Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol

要旨

Support