言语过程监督能培养出更优秀的编程智能体

摘要

大型语言模型及其作为AI代理的应用，显著推动了最先进的代码生成基准，革新了现代软件工程任务。然而，即便采用测试时计算推理模型，这些系统在处理复杂软件工程挑战时仍显不足。本研究提出了CURA，一种通过言语过程监督（VPS）增强的代码理解与推理代理系统，在BigCodeBench等挑战性基准上较基线模型提升了3.65%。此外，CURA与o3-mini模型及VPS技术结合，实现了当前最优性能。这一工作标志着在将推理驱动架构与基于LLM的代码生成相结合方面迈出了重要一步，使语言模型能够通过代理推理解决复杂的软件工程任务。

English

The emergence of large language models and their applications as AI agents have significantly advanced state-of-the-art code generation benchmarks, transforming modern software engineering tasks. However, even with test-time computed reasoning models, these systems still struggle with complex software engineering challenges. This work introduces CURA, a code understanding and reasoning agent system enhanced with verbal process supervision (VPS), achieving a 3.65\% improvement over baseline models on challenging benchmarks like BigCodeBench. Furthermore, CURA, when paired with the o3-mini model and VPS techniques, attains state-of-the-art performance. This work represents a step forward in integrating reasoning-driven architectures with LLM-based code generation, enabling agentic reasoning for language models to solve complex software engineering tasks.

言语过程监督能培养出更优秀的编程智能体

Verbal Process Supervision Elicits Better Coding Agents

摘要

Support