稳健推理基准测试

摘要

儘管大型語言模型在標準數學基準測試上表現優異，但其底層推理過程仍高度依賴常規文本格式。我們提出包含14種干擾技術的評估流程，用以檢驗LLM推理的穩健性。該方法應用於AIME 2024數據集，並對8個前沿模型進行基準測試。結果顯示：前沿模型展現出韌性，而開源權重推理模型卻出現災難性崩潰（平均準確率最高下降55%，特定干擾下達100%），暴露出結構性脆弱。為區分機械解析失誤與後續推理失誤，我們通過強制模型在單一上下文窗口內連續求解多個未受干擾的數學問題，嚴格隔離其工作記憶容量。實驗表明，從7B到120B參數的開源權重模型及Claude Opus 4.6在後續問題上均出現準確率衰減。這種退化證明中間推理步驟會永久污染標準的稠密注意力機制。我們主張，要實現可靠推理，未來推理架構必須在模型自身的思維鏈中整合顯式上下文重置機制，這將引發關於原子化推理任務最優粒度的根本性開放問題。

English

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models' working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model's own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.

稳健推理基准测试

Robust Reasoning Benchmark

摘要

Support