稳健推理基准

摘要

尽管大语言模型（LLMs）在标准数学基准测试中表现出色，但其底层推理过程仍高度依赖于标准文本格式。我们提出一个包含14种扰动技术的评估流程，用于检验LLM推理的鲁棒性。该流程应用于AIME 2024数据集，并对8个前沿模型进行测试。结果显示：前沿模型展现出较强韧性，而开源权重推理模型出现灾难性崩溃（扰动场景下平均准确率最高下降55%，部分情况达100%），暴露出结构性缺陷。为区分机械解析失败与下游推理失败，我们通过强制模型在单一上下文窗口中连续求解多个未扰动数学问题，严格隔离其工作记忆容量。实验表明，从7B到120B参数的开源权重模型及Claude Opus 4.6在后续问题中均出现准确率衰减。这种退化证明中间推理步骤会永久污染标准的稠密注意力机制。我们主张，要实现可靠推理，未来推理架构必须在模型自身的思维链中整合显式上下文重置机制，由此引发出关于原子推理任务最优粒度的根本性开放问题。

English

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models' working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model's own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.