四十二:基于对等排序共识的群体推理
Fortytwo: Swarm Inference with Peer-Ranked Consensus
October 27, 2025
作者: Vladyslav Larin, Ihor Naumenko, Aleksei Ivashov, Ivan Nikitin, Alexander Firsov
cs.AI
摘要
随着集中式人工智能触及算力瓶颈且大规模训练带来的边际效益递减,满足需求需要一个能在容量与能力上横向扩展的推理层。本文提出Fortytwo协议——一种基于群体智能原理与分布式成对排序共识的新颖协议,可在AI推理中实现卓越性能。我们的方法通过"群体推理"重构AI节点间的协作机制:这是一种跨异构模型的同行评分、声誉加权的共识机制,能筛选出最高质量的响应。采用自定义布拉德利-特里模型进行成对排序聚合的实验表明,群体推理显著优于多数投票法,在GPQA Diamond数据集上达到85.90%准确率,相较同模型集合下多数投票法的68.69%提升17.21个百分点(相对提升约25.1%)。该协议引入链上声誉机制,使节点影响力随实际准确率动态调整,形成优胜劣汰的共识体系以过滤低质或恶意参与者。为抵御女巫攻击,Fortytwo在共识中采用能力证明机制:节点需成功完成校准/测试请求并质押声誉值才能进入排序环节,在保持开放性的同时使多身份攻击丧失经济吸引力。在GPQA Diamond、LiveCodeBench和AIME等六项高难度基准测试中,评估结果显示该系统兼具更高准确率与强抗干扰能力(例如面对对抗性提示注入时性能仅下降0.12%,而单体单模型基线下降6.20%),并保持实际可部署性。这些成果为去中心化AI系统奠定了基石,通过集体智能实现高质量推理的民主化,且无需牺牲可靠性或安全性。
English
As centralized AI hits compute ceilings and diminishing returns from
ever-larger training runs, meeting demand requires an inference layer that
scales horizontally in both capacity and capability. We present Fortytwo, a
novel protocol that leverages swarm intelligence principles and distributed
pairwise ranking consensus to achieve superior performance in AI inference. Our
approach reimagines collaboration among AI nodes using swarm inference: a
peer-ranked, reputation-weighted consensus across heterogeneous models that
surfaces the highest-quality responses. Using pairwise ranking with a custom
Bradley-Terry-style aggregation model, we demonstrate that swarm inference
substantially outperforms majority voting, achieving 85.90% on GPQA Diamond
versus 68.69% for majority voting with the same model set - an improvement of
+17.21 percentage points (approximately +25.1% relative). The protocol
incorporates on-chain reputation so node influence adapts to demonstrated
accuracy over time, yielding a meritocratic consensus that filters low-quality
or malicious participants. To resist Sybil attacks, Fortytwo employs
proof-of-capability in its consensus: nodes must successfully complete
calibration/test requests and stake reputation to enter ranking rounds, making
multi-identity attacks economically unattractive while preserving openness.
Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and
AIME, our evaluation indicates higher accuracy and strong resilience to
adversarial and noisy free-form prompting (e.g., prompt-injection degradation
of only 0.12% versus 6.20% for a monolithic single-model baseline), while
retaining practical deployability. Together, these results establish a
foundation for decentralized AI systems - democratizing access to high-quality
inference through collective intelligence without sacrificing reliability or
security.