言語的プロセス監視がより優れたコーディングエージェントを導出する

要旨

大規模言語モデルの登場とそれらをAIエージェントとして応用することは、最先端のコード生成ベンチマークを大幅に進化させ、現代のソフトウェアエンジニアリングタスクを変革してきました。しかし、テスト時に計算された推論モデルを備えていても、これらのシステムは依然として複雑なソフトウェアエンジニアリングの課題に苦戦しています。本研究では、言語プロセス監視（VPS）を強化したコード理解および推論エージェントシステムであるCURAを紹介し、BigCodeBenchのような難易度の高いベンチマークにおいてベースラインモデルを3.65%上回る改善を達成しました。さらに、CURAはo3-miniモデルとVPS技術を組み合わせることで、最先端の性能を実現しています。この研究は、推論駆動型アーキテクチャとLLMベースのコード生成を統合し、言語モデルが複雑なソフトウェアエンジニアリングタスクを解決するためのエージェント的推論を可能にする一歩を表しています。

English

The emergence of large language models and their applications as AI agents have significantly advanced state-of-the-art code generation benchmarks, transforming modern software engineering tasks. However, even with test-time computed reasoning models, these systems still struggle with complex software engineering challenges. This work introduces CURA, a code understanding and reasoning agent system enhanced with verbal process supervision (VPS), achieving a 3.65\% improvement over baseline models on challenging benchmarks like BigCodeBench. Furthermore, CURA, when paired with the o3-mini model and VPS techniques, attains state-of-the-art performance. This work represents a step forward in integrating reasoning-driven architectures with LLM-based code generation, enabling agentic reasoning for language models to solve complex software engineering tasks.

言語的プロセス監視がより優れたコーディングエージェントを導出する

Verbal Process Supervision Elicits Better Coding Agents

要旨

Support