マルチエージェント推論におけるストリーミング通信

要旨

マルチエージェント推論システムは、「生成して転送する」パラダイムを採用しており、エンドツーエンドのレイテンシがパイプライン深度に比例して線形にスケールすることを余儀なくされる。我々はStreamMAを導入する。これは、各推論ステップが生成され次第、下流エージェントにストリーム配信することで、隣接エージェントをパイプライン化し、レイテンシを削減するマルチエージェント推論システムである。驚くべきことに、このパイプライン化は有効性も向上させる。なぜなら、多段階推論の品質は一様ではなく、初期ステップの方が後期ステップよりも信頼性が高いため、完全な連鎖ではなくこれらの信頼性の高い初期ステップを使用することで、エラーを起こしやすい後期ステップが下流エージェントを誤導するのを防ぐことができるからである。我々は、ストリーム、シリアル、シングルの各プロトコルに対する初の閉形式同時解析により、これら両方の利点を形式化し、有効性の順序、高速化の上限、コスト比率を導出する。数学、科学、コードにわたる8つの推論ベンチマーク、2つのフロンティアLLM（Claude Opus 4.6およびGPT-5.4）、および3つのトポロジ（チェーン、ツリー、グラフ）において、StreamMAは両方のベースラインを上回った（HMMT 2026で平均+7.3 pp、最大+22.4 pp；Claude Opus 4.6-high）。これらの貢献に加えて、我々は「ステップレベルのスケーリング則」を発見した。すなわち、エージェントあたりのステップ数を増やすと、有効性と効率の両方が一貫して向上する。これは、エージェント数のスケーリングとは直交し、組み合わせ可能な新たなスケーリング次元である。

English

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.