フォーティトゥー：ピア評価型コンセンサスによる群推論

要旨

集中型AIが計算リソースの限界に達し、大規模な学習実行から得られる収益が逓減する中、需要を満たすには容量と能力の両方で水平方向にスケールする推論層が求められる。我々はFortytwoを提案する。これは群衆知能の原理と分散型ペアワイズランキング合意を活用し、AI推論において優れた性能を実現する新規プロトコルである。本アプローチは、AIノード間の協働を「群衆推論」として再定義する。これは異種モデル間でのピア評価による評判加重合意により、最高品質の応答を選出する仕組みである。カスタムBradley-Terry式集約モデルを用いたペアワイズランキングにより、群衆推論が単純多数決を大幅に上回る性能を示すことを実証した（同一モデルセットでGPQA Diamondにおいて85.90% vs 68.69%、+17.21ポイントの改善、相対改善率約+25.1%）。本プロトコルはオンチェーン評判システムを組み込むことで、ノードの影響力を実績精度に応じて動的に調整し、低品質または悪意のある参加者を選別する実力主義の合意を実現する。Sybil攻撃への耐性確保のため、Fortytwoは合意に能力証明を採用する。ノードはランキング参加に際し、較正/テスト要求を成功裏に完了し評判をステークする必要があり、オープン性を維持しつつ複数ID攻撃を経済的に非合理とする。GPQA Diamond、LiveCodeBench、AIMEを含む6つの難易度の高いベンチマークによる評価では、従来の単一モデルベースライン（プロンプトインジェクションによる性能劣化6.20%）に対し、群衆推論は高い精度と敵対的/ノイジーな自由形式プロンプトへの強靭性（劣化率0.12%）を示し、実用性を保持することが確認された。これらの成果は、信頼性や安全性を損なうことなく集団知能による高品質な推論へのアクセスを民主化する、分散型AIシステムの基盤を確立するものである。

English

As centralized AI hits compute ceilings and diminishing returns from ever-larger training runs, meeting demand requires an inference layer that scales horizontally in both capacity and capability. We present Fortytwo, a novel protocol that leverages swarm intelligence principles and distributed pairwise ranking consensus to achieve superior performance in AI inference. Our approach reimagines collaboration among AI nodes using swarm inference: a peer-ranked, reputation-weighted consensus across heterogeneous models that surfaces the highest-quality responses. Using pairwise ranking with a custom Bradley-Terry-style aggregation model, we demonstrate that swarm inference substantially outperforms majority voting, achieving 85.90% on GPQA Diamond versus 68.69% for majority voting with the same model set - an improvement of +17.21 percentage points (approximately +25.1% relative). The protocol incorporates on-chain reputation so node influence adapts to demonstrated accuracy over time, yielding a meritocratic consensus that filters low-quality or malicious participants. To resist Sybil attacks, Fortytwo employs proof-of-capability in its consensus: nodes must successfully complete calibration/test requests and stake reputation to enter ranking rounds, making multi-identity attacks economically unattractive while preserving openness. Across six challenging benchmarks, including GPQA Diamond, LiveCodeBench, and AIME, our evaluation indicates higher accuracy and strong resilience to adversarial and noisy free-form prompting (e.g., prompt-injection degradation of only 0.12% versus 6.20% for a monolithic single-model baseline), while retaining practical deployability. Together, these results establish a foundation for decentralized AI systems - democratizing access to high-quality inference through collective intelligence without sacrificing reliability or security.