多智能體推理中的串流通訊

摘要

多智能體推理系統採用「生成再傳遞」的範式，迫使端到端延遲與管線深度呈線性增長。我們提出StreamMA，這是一種多智能體推理系統，能將每個推理步驟在生成後立即串流傳遞給下游智能體，從而將相鄰智能體執行管線化以降低延遲。令人驚訝的是，這種管線化也提升了有效性：由於多步驟推理的品質不均勻，且早期步驟比後期步驟更可靠，因此使用這些可靠的早期步驟而非完整鏈路，能防止容易出錯的後期步驟誤導下游智能體。我們透過首次對串流、序列及單一協定進行的閉合形式聯合分析，正式論證了這兩項優勢，推導出有效性排序、加速上限及成本比率。在涵蓋數學、科學與程式碼的八項推理基準測試、兩個前沿大型語言模型（Claude Opus 4.6與GPT-5.4）以及三種拓撲結構（鏈式、樹狀、圖狀）中，StreamMA均優於兩個基準模型（在HMMT 2026上平均提升7.3個百分點，最高提升22.4個百分點；Claude Opus 4.6-high）。除這些貢獻外，我們還發現了一項「步驟層級縮放定律」：增加每個智能體的步驟數能持續提升有效性與效率，這是一個與智能體數量縮放正交且可組合的全新縮放維度。

English

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.