生成と実行の同時処理：LLMコード生成における実行遅延の隠蔽

要旨

現在のLLMベースのコーディングエージェントは、逐次実行のパラダイムに従っている。つまり、モデルがまず完全なコードを生成し、その後インタプリタを起動して実行するという流れだ。このシーケンシャルなワークフローでは、生成中はエグゼキュータが、実行中はジェネレータがそれぞれアイドル状態となり、不必要なエンドツーエンドのレイテンシが生じる。我々は、人間の開発者とは異なり、LLMは修正を加えることなくコードトークンを逐次的に生成するため、コードの生成途中での実行が可能であると考察する。本論文では、この並列実行パラダイムを、生成、検出、実行の3段階のパイプラインとして定式化し、その高速化の可能性と動作領域を特徴付ける閉形式のレイテンシ限界を導出する。次に、ASTベースのチャンキング、ゲート付き実行による動的バッチ処理、早期エラー中断を特徴とする具体的な実装「Eager」を提示する。Eagerを4つのベンチマーク、7つのLLM、3つの実行環境で評価した結果、Eagerは7つのLLMと4つのベンチマークにわたって、非重複実行レイテンシを最大99.9%、エンドツーエンドレイテンシを最大55%削減できることが示された。

English

Current LLM-based coding agents follow a serial execution paradigm: the model first generates the complete code, then invokes an interpreter to execute it. This sequential workflow leaves the executor idle during generation and the generator idle during execution, resulting in unnecessary end-to-end latency. We observe that, unlike human developers, LLMs produce code tokens sequentially without revision, making it possible to execute code as it is being generated. We formalize this parallel execution paradigm, modeling it as a three-stage pipeline of generation, detection, and execution, and derive closed-form latency bounds that characterize its speedup potential and operating regimes. We then present Eager, a concrete implementation featuring AST-based chunking, dynamic batching with gated execution, and early error interruption. We evaluate Eager across four benchmarks, seven LLMs, and three execution environments. Results show that Eager reduces the non-overlapped execution latency by up to 99.9% and the end-to-end latency by up to 55% across seven LLMs and four benchmarks.

生成と実行の同時処理：LLMコード生成における実行遅延の隠蔽

Executing as You Generate: Hiding Execution Latency in LLM Code Generation

要旨

Support