言語モデルがプログラマーを置き換えることは可能か？REPOCODは「まだ不可能」と述べています。

要旨

大規模言語モデル（LLMs）は、HumanEvalとMBPPにおいてPythonコーディング問題を解決する際に90を超えるPass@1を示す驚異的な能力を示しています。このような高い精度は、LLMsが人間のプログラマーを置き換えることができるかという疑問を引き起こします。既存の手作業による簡単な単一行のコード生成ベンチマークでは、実際のソフトウェア開発との乖離があるため、この問いに答えることができません。この問いに答えるため、我々はREPOCODを提案します。これは、11の人気のある実世界プロジェクトから収集された980の問題を持つコード生成ベンチマークであり、そのうち58%以上がファイルレベルまたはリポジトリレベルのコンテキスト情報が必要とされています。さらに、REPOCODは、既存のベンチマークと比較して最も長い平均正準解の長さ（331.6トークン）と最も高い平均サイクロマティック複雑度（9.00）を持っています。10のLLMsによる評価では、どのモデルもREPOCODで30を超えるPass@1を達成することはできず、現実のソフトウェア開発において開発者を支援できるより強力なLLMsを構築する必要性が明らかになりました。

English

Large language models (LLMs) have shown remarkable ability in code generation with more than 90 pass@1 in solving Python coding problems in HumanEval and MBPP. Such high accuracy leads to the question: can LLMs replace human programmers? Existing manual crafted, simple, or single-line code generation benchmarks cannot answer this question due to their gap with real-world software development. To answer this question, we propose REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. In our evaluations on ten LLMs, none of the models can achieve more than 30 pass@1 on REPOCOD, disclosing the necessity of building stronger LLMs that can help developers in real-world software development.

言語モデルがプログラマーを置き換えることは可能か？REPOCODは「まだ不可能」と述べています。

Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'

要旨

Support