扩展智能体能力而非语境：面向大型工具集的高效强化学习微调

摘要

在庞大工具生态系统中运行的智能体系统，必须在弱监督或不可验证监督下规划并执行长周期工作流。虽然前沿模型通过规模优势和长上下文窗口缓解了这些挑战，但小参数语言模型（SLM）仍显脆弱：急切加载工具会导致上下文饱和，执行错误随时间累积，稀疏奖励则限制学习效率。我们提出ATLAS强化微调框架，通过让SLM学习上下文获取与动作执行的策略，使其能在大规模工具空间环境中有效运作。本方法包含两大核心贡献：首先将上下文控制与执行结构转化为可学习的决策，结合迭代式工具加载与程序化工具编排机制，以约束上下文增长并稳定长周期任务轨迹；其次提出基于量规的强化微调，将任务成功分解为结构化、任务对齐的评估标准，利用小型评判模型实现可扩展训练。在MCP基准测试中，这些设计选择相比通用强化学习基线带来显著且稳定的性能提升，使40亿参数SLM在更严格的参数和上下文预算下接近前沿智能体的表现。

English

Agentic systems operating over large tool ecosystems must plan and execute long-horizon workflows under weak or non-verifiable supervision. While frontier models mitigate these challenges through scale and large context budgets, small language models (SLMs) remain brittle: eager tool loading saturates context, execution errors compound over time, and sparse rewards limit learning. We introduce ATLAS, a reinforcement finetuning framework that enables SLMs to operate effectively in large-scale toolspace environments by learning how to acquire context and how to execute actions. Our approach makes two key contributions. First, we treat context control and execution structure as learnable decisions, combining iterative tool loading with programmatic tool orchestration to bound context growth and stabilize long-horizon trajectories. Second, we propose rubric-based reinforcement finetuning, which decomposes task success into structured, task-aligned criteria and enables scalable training using small judge models. Across MCP benchmarks, these design choices yield large and consistent gains over generic RL baselines, allowing a 4B SLM to approach frontier-agent performance under far tighter parameter and context budgets.

扩展智能体能力而非语境：面向大型工具集的高效强化学习微调

Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces

摘要

Support