基于过程奖励的多智能体系统扩展
Scaling Multiagent Systems with Process Rewards
January 30, 2026
作者: Ed Li, Junyu Ren, Cat Yan
cs.AI
摘要
尽管多智能体系统已通过专业化分工展现出处理复杂任务的潜力,但同时微调多个智能体仍面临两大挑战:(1)跨智能体的功劳分配问题;(2)昂贵多智能体模拟的样本效率问题。本研究提出基于人工智能反馈的逐动作过程奖励微调方法(MAPPA)以同时解决这两个难题。通过将功劳分配细化至单个智能体动作而非仅针对任务完成度,MAPPA能够在无需真实标签的情况下实现细粒度监督,并从每次模拟中提取最大化的训练信号。我们在数学竞赛题和工具增强的数据分析任务上验证了该方法。在未见过的数学问题上,MAPPA在AIME和AMC测试中分别提升5.0-17.5个百分点和7.8-17.2个百分点;在数据分析任务中,成功率提高12.5个百分点,质量指标最高提升30%,证明逐动作监督能推动不同领域多智能体系统的全面改进。通过解决这些核心挑战,我们的工作为在最小人力监督下扩展多智能体系统处理复杂长周期任务迈出了第一步。
English
While multiagent systems have shown promise for tackling complex tasks via specialization, finetuning multiple agents simultaneously faces two key challenges: (1) credit assignment across agents, and (2) sample efficiency of expensive multiagent rollouts. In this work, we propose finetuning multiagent systems with per-action process rewards from AI feedback (MAPPA) to address both. Through assigning credit to individual agent actions rather than only at task completion, MAPPA enables fine-grained supervision without ground truth labels while extracting maximal training signal from each rollout. We demonstrate our approach on competition math problems and tool-augmented data analysis tasks. On unseen math problems, MAPPA achieves +5.0--17.5pp on AIME and +7.8--17.2pp on AMC. For data analysis tasks, our method improves success rate by +12.5pp while quality metrics improve by up to 30%, validating that per-action supervision can lead to improvements across different multiagent system on various domains. By addressing these challenges, our work takes a first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human supervision.