モデル能力が支配的：AIMO 3から得た推論時最適化の教訓

要旨

複数のLLM試行に対する多数決は数学的推論を改善するが、相関誤差が実効サンプルサイズを制限する。自然な解決策は、異なる投票者に異なる推論戦略を割り当てることである。このアプローチ「Diverse Prompt Mixer」をAIMO 3競争で検証：3モデル、23以上の実験、50問のIMOレベル問題、H100 80GB 1台、5時間制限。プロンプトレベルの介入はすべて失敗。高温サンプリングは既に誤差の相関を除去し、弱い戦略は相関を減らす以上に精度を低下させる。N=8均等条件での8ポイントの能力格差及び全ての最適化試験において、モデル能力が支配的。最良の多数決スコア（42/50）とpass@20（～45.5）の差は選択損失であり、プロンプト損失ではない。検証器ベースの選択器で埋められる可能性がある。プロンプトエンジニアリングでは不可能。

English

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to assign different reasoning strategies to different voters. The approach, Diverse Prompt Mixer, is tested on the AIMO 3 competition: 3 models, 23+ experiments, 50 IMO-level problems, one H100 80 GB, 5-hour limit. Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation. Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates. The gap between the best majority-vote score (42/50) and pass@20 (~45.5) is selection loss, not prompt loss. A verifier-based selector could close it. Prompt engineering cannot.

モデル能力が支配的：AIMO 3から得た推論時最適化の教訓

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

要旨

Support