ChatPaper.aiChatPaper

Evoflux: 面向紧凑型代理的推理时可执行工具工作流演化

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

June 10, 2026
作者: Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel, Shaowu Pan, Pin-Yu Chen, Jianxi Gao
cs.AI

摘要

紧凑型语言模型(LMs)能降低工具代理的成本、延迟与部署风险。然而,MCP风格的工具使用远不止孤立的函数调用:代理必须从实时目录中发现工具、满足模式约束、维护中间输出间的依赖关系,并将最终响应锚定于可执行的证据链上。小型规划器常能生成看似合理的工作流图,却在工具解析、参数校验、依赖追踪或执行环节中失败。我们认为此类失败模式难以通过小规模语料蒸馏解决。数百条教师轨迹仅能教会工作流格式,却几乎无法覆盖应对动态工具目录中计划修复的恢复行为。为此我们提出Evoflux——一种推理时进化搜索方法,将紧凑型工具使用视为可执行工作流图的修复过程。它通过结构化编辑、执行反馈、自适应强度、元引导重构及多样性剪枝来演化类型化工作流图。在覆盖实时MCP服务器与250个工具的MCP-Bench留出任务上,Evoflux将小型规划器的执行可行性从约3%提升至17-24%。对比之下,基于相同搜索挖掘数据的SFT与SFT+DPO方法表现、不及或甚至低于零样本性能;ReAct虽能达到更高峰值,但伴随更高方差与令牌成本。结果表明,在稀缺教师轨迹预算下,基于执行反馈的搜索更为可靠。
English
Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.