ChatPaper.aiChatPaper

MEnvAgent:面向可验证软件工程的可扩展多语言环境构建框架

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

January 30, 2026
作者: Chuanzhe Guo, Jingjing Wu, Sijun He, Yang Chen, Zhaoqi Kuang, Shilong Fan, Bingjin Chen, Siqi Bao, Jing Liu, Hua Wu, Qingfu Zhu, Wanxiang Che, Haifeng Wang
cs.AI

摘要

针对大型语言模型(LLM)在软件工程(SWE)领域的智能体演进受限于可验证数据集的稀缺性,这一瓶颈源于跨多种编程语言构建可执行环境的复杂性。为此,我们提出MEnvAgent——一种支持多语言的自动化环境构建框架,能够规模化生成可验证任务实例。该框架采用"规划-执行-验证"多智能体架构自主解决构建故障,并集成创新的环境复用机制,通过增量式修补历史环境降低计算开销。基于涵盖10种编程语言的千级任务新基准MEnvBench的评估表明,MEnvAgent在失败转通过率(F2P)上较基线提升8.6%,同时时间成本降低43%。此外,我们通过构建MEnvData-SWE验证了该框架的实用性:该数据集是目前最大的开源多语言可验证Docker环境集合,包含真实场景下的解决方案轨迹,能使各类模型在SWE任务中获得稳定性能提升。相关代码、基准及数据集已开源:https://github.com/ernie-research/MEnvAgent。
English
The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a bottleneck stemming from the complexity of constructing executable environments across diverse languages. To address this, we introduce MEnvAgent, a Multi-language framework for automated Environment construction that facilitates scalable generation of verifiable task instances. MEnvAgent employs a multi-agent Planning-Execution-Verification architecture to autonomously resolve construction failures and integrates a novel Environment Reuse Mechanism that reduces computational overhead by incrementally patching historical environments. Evaluations on MEnvBench, a new benchmark comprising 1,000 tasks across 10 languages, demonstrate that MEnvAgent outperforms baselines, improving Fail-to-Pass (F2P) rates by 8.6% while reducing time costs by 43%. Additionally, we demonstrate the utility of MEnvAgent by constructing MEnvData-SWE, the largest open-source polyglot dataset of realistic verifiable Docker environments to date, alongside solution trajectories that enable consistent performance gains on SWE tasks across a wide range of models. Our code, benchmark, and dataset are available at https://github.com/ernie-research/MEnvAgent.
PDF131February 6, 2026