エボフラックス：コンパクトエージェント向け実行可能ツールワークフローの推論時進化

要旨

コンパクトな言語モデル（LM）は、ツールエージェントのコスト、レイテンシ、デプロイリスクを低減する。しかし、MCPスタイルのツール使用には、単なる関数呼び出し以上の能力が求められる。すなわち、エージェントはライブカタログからツールを発見し、スキーマを充足し、中間出力間の依存関係を保持し、最終的な応答を実行結果に基づいて根拠づける必要がある。小型プランナーは、もっともらしいワークフローグラフを生成するものの、ツール解決、パラメータ検証、依存関係追跡、あるいは実行の段階で失敗することが多い。本稿では、この失敗モードが小規模コーパスからの蒸留ではうまく対処できないことを論じる。数百の教師トレースによってワークフローの形式を学習させることは可能だが、変化するツールカタログのもとで失敗した計画を修復するために必要な回復行動をカバーすることは稀である。本稿では、コンパクトモデルによるツール使用を実行可能なツールワークフローの修復として捉える、推論時進化的探索手法であるEvofluxを提案する。Evofluxは、構造化された編集、実行フィードバック、適応型強度、メタガイドによる再設計、多様性枝刈りを通じて、型付きワークフローグラフを進化させる。実際のMCPサーバと250のツールからなる、評価用に保持されたMCP-Benchタスクにおいて、Evofluxは小型プランナー群の実行実現可能性を約3%から17〜24%に向上させる。対照的に、同一の探索収集データに基づくSFTおよびSFT+DPOは、ゼロショット性能と同等かそれを下回るか、あるいはそれを下回って崩壊する。ReActはより高いピークに達するものの、分散とトークンコストも高い。これらの結果は、教師トレースの予算が限られている状況では、実行に基づく探索の方がより信頼性が高いことを示している。

English

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.