多智能体推理中的流式通信

摘要

多智能体推理系统采用“生成再传递”范式，迫使端到端延迟随流水线深度呈线性增长。我们提出StreamMA——一种多智能体推理系统，将每个推理步骤一旦生成便立即流式传递给下游智能体，通过流水线化相邻智能体来降低延迟。令人意外的是，这种流水线化也提升了有效性：由于多步推理质量非均匀分布，早期步骤比后期步骤更可靠，因此使用这些可靠的早期步骤而非完整推理链，可避免易出错的后期步骤误导下游智能体。我们首次通过串行、流式与单协议协议的闭式联合分析，形式化地推导出这两种优势，得出有效性次序、加速上限与成本比。在涵盖数学、科学与代码的八个推理基准、两个前沿大语言模型（Claude Opus 4.6与GPT-5.4）以及三种拓扑结构（链式、树状、图式）上，StreamMA均优于两个基线（在HMMT 2026、Claude Opus 4.6-high上平均提升7.3个百分点，最高提升22.4个百分点）。除上述贡献外，我们还发现一种“步级缩放定律”：增加每个智能体的步骤数量能持续提升有效性与效率，这是一个与智能体数量缩放正交且可组合的新缩放维度。

English

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.