在最小形式化下使用證明對大型語言模型的推理能力進行壓力測試

摘要

我們介紹了 ProofGrid，這是一個基準套件，用於透過機器可驗證的證明（而非僅依賴最終答案）來評估大型語言模型的推理能力。ProofGrid 包含 15 項任務，涵蓋證明撰寫、證明檢查、證明遮蔽與證明填空。所有任務均以最小化的形式化符號表達，尤其是 NDL（一種緊湊的自然演繹語言），它適用於簡短提示，並支援精確、可審計的驗證。這實現了機械性、可重複且細粒度的評估，而非依賴人類或大型語言模型進行判斷。ProofGrid 涵蓋了一個經過校準的難度光譜，從基礎推理測試到結構豐富的挑戰性任務（目前尚無模型能解決），同時將對領域知識、求解器委派及長上下文偽影的依賴降至最低。我們也發展了一個用於推理基準比較的框架，並藉此將 ProofGrid 定位於現有相關工作之中，從表徵方式、驗證保證與推理深度等面向進行比較。在方法論上，我們引入了一套經過儀器化設計的證明驗證流程，該流程能容忍微小的表面偏差，同時定位第一個實質性的推理失敗點，從而改善測量解析度，並將證明規劃與低階執行雜訊分離。利用此流程，我們評估了廣泛的開放源碼及專有模型。結果顯示進展迅速，但仍存在重大限制：前沿模型在若干基礎任務上表現良好，然而困難任務（尤其是那些需要全域組合推理或低階證明合成的任務）仍遠未解決。我們也發現了認知不穩定性現象——模型會產生有缺陷的證明，卻能正確拒絕那些局部推理，並以認知穩定性指標加以形式化。最後，我們以 2PL IRT 分析、Wright 圖，以及基於 Fisher 資訊的標準化任務區分度量來補充準確率評估。

English

We introduce ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone. ProofGrid contains 15 tasks spanning proof writing, proof checking, proof masking, and proof gap-filling. Tasks are expressed in minimal formal notation, especially NDL, a compact natural-deduction language that fits in short prompts and supports precise, auditable verification. This yields mechanical, reproducible, and fine-grained evaluation rather than judgments by humans or LLMs. ProofGrid covers a calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no current model solves, while minimizing reliance on domain knowledge, solver delegation, and long-context artifacts. We also develop a comparative framework for reasoning benchmarks and use it to situate ProofGrid relative to existing work in terms of representation, verification guarantees, and reasoning depth. Methodologically, we introduce an instrumented proof-checking pipeline that tolerates minor surface deviations while locating the first substantive reasoning failure, improving measurement resolution and separating proof planning from low-level execution noise. Using this pipeline, we evaluate a broad range of open and proprietary models. Results show rapid progress but substantial remaining limits: frontier models perform well on several foundational tasks, yet difficult tasks, especially those requiring global combinatorial reasoning or low-level proof synthesis, remain far from solved. We also identify epistemic instability, where models generate flawed proofs yet correctly reject those local inferences in isolation, and formalize this with an Epistemic Stability Index. Finally, we complement accuracy with 2PL IRT analyses, Wright maps, and a normalized task-discrimination measure based on Fisher information.