使用最小形式化证明对大型语言模型推理能力的压力测试

摘要

我们提出ProofGrid，这是一个通过机器可检验证明而非仅依据最终答案来评估大语言模型推理能力的基准测试套件。ProofGrid包含15项任务，涵盖证明书写、证明检查、证明掩码及证明补全。这些任务以极简形式化符号表述，尤其采用NDL——一种紧凑的自然演绎语言，可嵌入短提示中，并支持精确、可审计的验证。由此实现机械化、可复现且细粒度的评估，而非依赖人类或大语言模型判断。ProofGrid覆盖校准的难度梯度，从基础推理测试到结构复杂的挑战性任务（当前模型均无法解决），同时最大程度降低对领域知识、求解器代理及长上下文伪影的依赖。我们还开发了一套用于推理基准的对比框架，并以此从表示方式、验证保障及推理深度三个维度，将ProofGrid定位至现有相关工作之中。在方法论上，我们引入了一套带检测的证明验证流程，该流程能容忍细微的表面差异，同时定位首次实质性推理失败点，从而提升测量分辨率，并将证明规划与低层次执行噪声相分离。利用此流程，我们评估了广泛的开源与闭源模型。结果表明，模型进展迅速但仍有显著局限：前沿模型在若干基础任务上表现良好，但困难任务——尤其是需要全局组合推理或低层次证明综合的任务——仍远未解决。我们还识别出一种认知不稳定性现象：模型能生成有缺陷的证明，却能正确拒绝那些局部推理步骤；我们通过认知稳定性指数将其形式化。最后，我们以双参数项目反应理论分析、赖特图及基于Fisher信息的标准化任务区分度指标，对准确性评估进行了补充。

English

We introduce ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone. ProofGrid contains 15 tasks spanning proof writing, proof checking, proof masking, and proof gap-filling. Tasks are expressed in minimal formal notation, especially NDL, a compact natural-deduction language that fits in short prompts and supports precise, auditable verification. This yields mechanical, reproducible, and fine-grained evaluation rather than judgments by humans or LLMs. ProofGrid covers a calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no current model solves, while minimizing reliance on domain knowledge, solver delegation, and long-context artifacts. We also develop a comparative framework for reasoning benchmarks and use it to situate ProofGrid relative to existing work in terms of representation, verification guarantees, and reasoning depth. Methodologically, we introduce an instrumented proof-checking pipeline that tolerates minor surface deviations while locating the first substantive reasoning failure, improving measurement resolution and separating proof planning from low-level execution noise. Using this pipeline, we evaluate a broad range of open and proprietary models. Results show rapid progress but substantial remaining limits: frontier models perform well on several foundational tasks, yet difficult tasks, especially those requiring global combinatorial reasoning or low-level proof synthesis, remain far from solved. We also identify epistemic instability, where models generate flawed proofs yet correctly reject those local inferences in isolation, and formalize this with an Epistemic Stability Index. Finally, we complement accuracy with 2PL IRT analyses, Wright maps, and a normalized task-discrimination measure based on Fisher information.