ChatPaper.aiChatPaper

回顾式驾驭优化:通过轨迹展开上的自我偏好改进LLM智能体

Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts

June 4, 2026
作者: Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia
cs.AI

摘要

AI代理依赖于技能、工具和工作流的协同组合来解决复杂问题。持续优化这一组合对于适应新任务至关重要。然而,现有优化方法通常需要真实标注验证集,但在实际部署场景中获取此类标注数据十分困难。为解决这一问题,我们提出**回顾性工具集优化(RHO)**,一种仅利用历史轨迹即可优化代理工具集的自监督方法。具体而言,RHO从历史轨迹中选取具有挑战性任务的多样化核心集,并并行重新求解。代理通过自我验证与自我一致性分析这些重放轨迹,生成候选工具集更新方案,并通过自身的成对自我偏好选择最优方案。我们在软件工程、技术工作和知识工作三个不同领域评估了RHO。值得注意的是,单轮优化即可将SWE-Bench Pro上的通过率从59%提升至78%,且无需任何外部评估。进一步分析表明,RHO能有效针对先前的失败模式。因此,优化后的工具集改变代理的行为模式,并在长周期任务会话中维持更高准确性。
English
AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.