REVERE：面向科学工作流的反思式演进研究引擎

摘要

现有提示优化技术依赖局部信号更新行为，常忽视跨任务的全局重复模式，导致泛化能力不足；其进一步依赖完整提示重写或非结构化合并，造成知识损失。这些局限在研究型编程工作流中被放大——该场景涉及异构代码库、未明确指定的环境和弱反馈机制，且以复现公共代码库结果作为既定评估标准。我们提出反射式演进研究工程师（REVERE）框架，该框架能够持续从全局训练语境中学习，识别跨代码库执行轨迹中的重复故障模式，将其提炼为可复用启发式规则，并对三个可配置字段实施定向编辑：系统提示、任务提示模板和累积速查表。通过这种反射式优化机制，REVERE在研究编码任务上的表现较先前最先进的专家定制指令分别提升：SUPER基准4.50%、ResearchCodeBench基准3.51%、ScienceAgentBench基准4.89%。结果表明，具备持续学习与全局记忆整合机制的智能体能够实现能力的实质性演进。

English

Existing prompt-optimization techniques rely on local signals to update behavior, often neglecting broader and recurring patterns across tasks, leading to poor generalization; they further rely on full-prompt rewrites or unstructured merges, resulting in knowledge loss. These limitations are magnified in research-coding workflows, which involve heterogeneous repositories, underspecified environments, and weak feedback, where reproducing results from public codebases is an established evaluation regime. We introduce Reflective Evolving Research Engineer (REVERE), a framework that continuously learns from Global Training Context, recognizes recurring failure modes in cross-repository execution trajectories, distills them into reusable heuristics, and performs targeted edits across three configurable fields: the system prompt, a task-prompt template, and a cumulative cheatsheet. REVERE, via this reflective optimization framework, improves performance over prior state-of-the-art expert-crafted instructions on research coding tasks by 4.50% on SUPER, 3.51% on ResearchCodeBench, and 4.89% on ScienceAgentBench across their respective metrics. These results demonstrate that agents equipped with mechanisms for continual learning and global memory consolidation can meaningfully evolve their capabilities over time.