Evoflux: 컴팩트 에이전트를 위한 실행 가능한 도구 워크플로우의 추론 시간 진화

초록

경량 언어 모델(LM)은 도구 에이전트의 비용, 지연 시간 및 배포 위험을 줄여준다. 그러나 MCP 스타일의 도구 사용은 단순한 함수 호출 이상을 요구한다. 에이전트는 라이브 카탈로그에서 도구를 발견하고, 스키마를 충족하며, 중간 출력 간의 의존성을 보존하고, 실행된 증거에 기반하여 최종 응답을 근거해야 한다. 소형 플래너는 종종 그럴듯한 워크플로 그래프를 생성하지만, 도구 해석, 매개변수 검증, 의존성 추적 또는 실행 과정에서 실패한다. 우리는 이러한 실패 모드가 소규모 코퍼스 증류(small-corpus distillation)로는 제대로 처리되지 않는다고 주장한다. 수백 개의 교사 궤적(teacher trace)으로 워크플로 형식을 가르칠 수는 있지만, 변화하는 도구 카탈로그에 대해 실패한 계획을 복구하는 데 필요한 복구 동작(recovery behavior)은 거의 다루지 못한다. 우리는 Evoflux를 소개한다. 이는 추론 시간 진화 검색(inference-time evolutionary search) 방법으로, 경량 도구 사용을 실행 가능한 도구 워크플로의 복구로 취급한다. 구조화된 편집, 실행 피드백, 적응형 강도, 메타 유도 재설계, 다양성 가지치기를 통해 유형화된 워크플로 그래프를 진화시킨다. 라이브 MCP 서버와 250개의 도구를 포함하는 독립적인 MCP-Bench 태스크에서 Evoflux는 소형 플래너들의 실행 가능성을 약 3%에서 17~24%로 향상시킨다. 이에 반해, 동일한 검색 기반 데이터에 대해 SFT 및 SFT+DPO는 제로샷 성능과 일치하거나, 저조하거나, 붕괴한다. ReAct는 더 높은 최고점에 도달하지만, 더 높은 분산과 토큰 비용을 보인다. 이러한 결과는 실행 기반 검색이 희소한 교사 궤적 예산 하에서 더 신뢰할 수 있음을 보여준다.

English

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.