口語化過程監督引導出更優異的編程代理

摘要

大型語言模型及其作為AI代理的應用，顯著推進了最先進的代碼生成基準，改變了現代軟件工程任務的面貌。然而，即便配備了測試時計算的推理模型，這些系統在應對複雜的軟件工程挑戰時仍顯不足。本研究介紹了CURA，這是一個通過言語過程監督（VPS）增強代碼理解與推理能力的代理系統，在BigCodeBench等具有挑戰性的基準測試中，相較於基礎模型實現了3.65%的性能提升。此外，CURA與o3-mini模型及VPS技術結合，達到了業界領先的性能水平。這項工作標誌著在將推理驅動架構與基於LLM的代碼生成相結合方面邁出了重要一步，使語言模型能夠通過代理推理來解決複雜的軟件工程任務。

English

The emergence of large language models and their applications as AI agents have significantly advanced state-of-the-art code generation benchmarks, transforming modern software engineering tasks. However, even with test-time computed reasoning models, these systems still struggle with complex software engineering challenges. This work introduces CURA, a code understanding and reasoning agent system enhanced with verbal process supervision (VPS), achieving a 3.65\% improvement over baseline models on challenging benchmarks like BigCodeBench. Furthermore, CURA, when paired with the o3-mini model and VPS techniques, attains state-of-the-art performance. This work represents a step forward in integrating reasoning-driven architectures with LLM-based code generation, enabling agentic reasoning for language models to solve complex software engineering tasks.

口語化過程監督引導出更優異的編程代理

Verbal Process Supervision Elicits Better Coding Agents

摘要

Support