何时组合语言模型有用?——在67个前沿模型上路由、投票与智能体混合的共失效上限
When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models
June 25, 2026
作者: Josef Chen
cs.AI
摘要
多模型大语言模型系统(如路由、投票、级联、融合与混合代理)常用于超越单模型准确率。本文表明,这些系统的增益受限于该领域鲜有报告的一个量:对于输出为某个成员模型答案的任何策略,其准确率不能超过 1 减去 β,其中 β 是每个模型在同一查询上都出错的比率。相比之下,常用诊断指标——平均成对误差相关性 ρ——无法识别 β:具有相同边际分布和成对相关性的误差定律可能对应不同的全错率。对 β 的 Clopper-Pearson 界限给出了一个有限样本保证,即在训练路由之前,任何路由器、投票或级联所能带来的最大增益。
在来自 21 个提供商的 67 个模型中,一个经四分相关校准的单因子模型仍低估了全错尾部:在开放式数学问题上,观测到的 β 为 0.052,而在完整 67 模型高斯 copula 下为 0.023,低估约 2.5 倍,90% 置信区间为 1.7 至 3.4,且 k = 17。该效应在执行评分代码的任务上重现,β 为 0.079。将相同的 GPQA-Diamond 问题以自由作答而非多项选择形式重新提问,尾部再次张开,β 为 0.127,由五位评委组成的评审团(卡帕系数 0.73 至 0.92)将共失败归因于答案格式而非主题。在质量对等时,低 ρ 的异质集成优于高 ρ 的 Self-MoA,但在我们的可检查任务池中,若无强有力的查询级路由信号,组合模型很少能超越单一最佳模型。增益来自模型在不同问题上失败,而非增加更多模型。
English
Multi-model LLM systems such as routing, voting, cascades, fusion, and mixture-of-agents are used to beat single-model accuracy. We show that their gain is capped by a quantity the field rarely reports. For any policy whose output is one member model answer, accuracy cannot exceed one minus beta, where beta is the rate at which every model is wrong on the same query. In contrast, the usual diagnostic, average pairwise error correlation rho, cannot identify beta: error laws with identical marginals and pairwise correlations can have different all-wrong rates. A Clopper-Pearson bound on beta gives a finite-sample certificate on the largest gain any router, vote, or cascade could deliver before training a router.
Across 67 models from 21 providers, a tetrachoric-calibrated single-factor model still underprices the all-wrong tail: on open-ended mathematics, observed beta is 0.052 versus 0.023 under the full 67-model Gaussian copula, about 2.5 times underpricing, with 90 percent CI 1.7 to 3.4 and k equals 17. The effect recurs on execution-graded code, where beta is 0.079. Re-asking the same GPQA-Diamond questions in free-response rather than multiple-choice form reopens the tail, with beta 0.127 and a five-judge panel with kappa 0.73 to 0.92, locating co-failure in answer format rather than subject. At matched quality, low-rho heterogeneous ensembles beat high-rho Self-MoA, but on checkable tasks in our pool, combining models rarely beats the single best model without a strong query-level routing signal. Gains come from models failing on different questions, not from adding more models.