REVERE：面向科学工作流的反思式演进研究引擎

摘要

现有的提示优化技术依赖局部信号来更新行为，往往忽略跨任务中更广泛且重复出现的模式，导致泛化能力不足；这些技术还依赖于完整提示重写或非结构化合并，造成知识损失。这些局限在研究型编程工作流中尤为突出——这类工作流涉及异构代码库、未明确指定的环境及弱反馈机制，且以复现公共代码库结果作为既定评估标准。我们提出反射式演进研究工程师框架，该框架通过持续学习全局训练上下文，识别跨代码库执行轨迹中的重复故障模式，将其提炼为可复用的启发式规则，并对三个可配置字段（系统提示、任务提示模板和累积速查表）进行定向编辑。实验表明，借助这种反射式优化框架，在研究型编程任务上的表现相较此前最优的人工编写指令分别提升：SUPER基准4.50%、ResearchCodeBench基准3.51%、ScienceAgentBench基准4.89%。这些结果证明，具备持续学习与全局记忆整合机制的智能体能够随时间推移实现能力的实质性演进。

English

Existing prompt-optimization techniques rely on local signals to update behavior, often neglecting broader and recurring patterns across tasks, leading to poor generalization; they further rely on full-prompt rewrites or unstructured merges, resulting in knowledge loss. These limitations are magnified in research-coding workflows, which involve heterogeneous repositories, underspecified environments, and weak feedback, where reproducing results from public codebases is an established evaluation regime. We introduce Reflective Evolving Research Engineer (REVERE), a framework that continuously learns from Global Training Context, recognizes recurring failure modes in cross-repository execution trajectories, distills them into reusable heuristics, and performs targeted edits across three configurable fields: the system prompt, a task-prompt template, and a cumulative cheatsheet. REVERE, via this reflective optimization framework, improves performance over prior state-of-the-art expert-crafted instructions on research coding tasks by 4.50% on SUPER, 3.51% on ResearchCodeBench, and 4.89% on ScienceAgentBench across their respective metrics. These results demonstrate that agents equipped with mechanisms for continual learning and global memory consolidation can meaningfully evolve their capabilities over time.

REVERE：面向科学工作流的反思式演进研究引擎

REVERE: Reflective Evolving Research Engineer for Scientific Workflows

摘要

Support