模型能力主导：从AIMO 3看推理时优化的启示

摘要

在多轮大语言模型尝试中采用多数投票法能提升数学推理能力，但误差相关性限制了有效样本量。一个自然的解决方案是为不同投票者分配不同的推理策略。我们在AIMO 3竞赛中测试了名为"多样化提示混合器"的方法：使用3个模型、进行23+次实验、处理50道IMO级别题目、在单张H100 80GB显卡上以5小时为限。所有提示层面的干预均告失败。高温采样本身已能消除误差相关性；较弱策略降低相关性的效果不及其对准确率的负面影响。在同等N=8样本量及所有优化尝试中，模型能力差距达8个点且始终起主导作用。最佳多数投票得分（42/50）与pass@20（约45.5）之间的差距源于选择损耗而非提示损耗。基于验证器的选择器可弥补这一差距，而提示工程无法实现。

English

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to assign different reasoning strategies to different voters. The approach, Diverse Prompt Mixer, is tested on the AIMO 3 competition: 3 models, 23+ experiments, 50 IMO-level problems, one H100 80 GB, 5-hour limit. Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation. Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates. The gap between the best majority-vote score (42/50) and pass@20 (~45.5) is selection loss, not prompt loss. A verifier-based selector could close it. Prompt engineering cannot.

模型能力主导：从AIMO 3看推理时优化的启示

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

摘要

Support