SURGE: 大規模言語モデルを汎用代理コード実行環境としての潜在能力について

要旨

大規模言語モデル（LLMs）は、コード理解やコード生成などのコード関連タスクで顕著な能力を示しています。しかし、同様に重要でありながら未開拓の問題は、LLMsがプログラムを実際に実行せずに出力や振る舞いを予測する一般的な代替コード実行者として機能できるかどうかです。この能力を体系的に調査するために、私たちはSURGEを導入します。これは、マルチ言語プログラミングタスク、競技レベルのプログラミング問題、リポジトリレベルのコード分析、高コストの科学計算、時間複雑度の高いアルゴリズム、バグのあるコード分析、特定のコンパイラや実行環境に依存するプログラム、および形式的な数学的証明検証という8つの主要な側面をカバーする包括的なベンチマークです。私たちは、複数のオープンソースおよびプロプライエタリなLLMsをSURGEで評価し、モデルサイズとトレーニングデータ規模が代替実行の精度に与える影響を分析するスケーリング研究を行います。さらに、モデルの予測エラーを分類し、改善のための可能性のある領域を探索します。私たちの調査結果は、LLMsが特定のケースでコード実行結果を予測できる一方で、一般的な代替実行には制約があることを示しています。この研究は、LLMsを代替コード実行者として使用する可能性についての経験的な洞察を提供します。コードとデータセットは、https://github.com/Imbernoulli/SURGE で公開されています。

English

Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation. However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors, to predict the output and behavior of a program without actually running it. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark covering eight key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. We evaluate multiple open-source and proprietary LLMs on SURGE and conduct a scaling study to analyze the impact of model size and training data scale on surrogate execution accuracy. Additionally, we categorize model prediction errors and explore potential areas for improvement. Our findings indicate that while LLMs can predict code execution results in certain cases, they exhibit limitations in general-purpose surrogate execution. This study provides empirical insights into the feasibility of using LLMs as surrogate code executors. Code and dataset are released at https://github.com/Imbernoulli/SURGE.

SURGE: 大規模言語モデルを汎用代理コード実行環境としての潜在能力について

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

要旨

Support