FAMA：面向开源大语言模型的交互式工具使用环境中的故障感知元智能体框架

摘要

大型语言模型正日益被部署为能够影响外部环境的自主智能体的决策核心。然而，在模拟现实世界以客户为中心的问题解决场景的对话基准测试中，这些智能体常因错误决策的连锁效应而失败。对于参数规模较小、上下文窗口有限且推理预算受限的开源LLM而言，这些挑战尤为突出，导致其在智能体场景中的错误累积加剧。为应对这些挑战，我们提出故障感知元智能体（FAMA）框架。FAMA采用两阶段工作模式：首先分析基线智能体的故障轨迹以识别最常见错误；其次通过编排机制，在决策步骤前激活针对这些故障定制的专业化智能体最小子集，为工具使用型智能体注入定向上下文。在开源LLM上的实验表明，该框架在多种评估模式下相较标准基线最高可获得27%的性能提升。这些结果证明，通过专业化智能体对上下文进行针对性优化以解决常见故障，是构建模拟真实对话场景的可靠多轮工具使用型LLM智能体的重要设计原则。

English

Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool-use agent before the decision-making step. Experiments across open-source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi-turn tool-use LLM agents that simulate real-world conversational scenarios.