最小限の形式主義下での証明によるLLMの推論能力のストレステスト

要旨

我々は、最終回答のみではなく機械検証可能な証明を通じてLLMの推論を評価するためのベンチマークスイートであるProofGridを紹介する。ProofGridは、証明作成、証明検証、証明マスキング、証明ギャップ埋めにわたる15のタスクを含む。タスクは最小限の形式的記法、特に短いプロンプトに収まり、精密で監査可能な検証をサポートするコンパクトな自然演繹言語であるNDLで表現される。これにより、人間やLLMによる判断ではなく、機械的で再現可能かつきめ細かな評価が実現される。ProofGridは、基礎的な推論テストから現在のモデルでは解けない構造的に豊かなチャレンジタスクまで、校正された難易度のスペクトラムをカバーし、ドメイン知識、ソルバー委任、長文脈アーティファクトへの依存を最小限に抑えている。また、推論ベンチマークの比較フレームワークを開発し、それを用いてProofGridを表現、検証保証、推論深度の観点から既存研究との関連で位置づける。方法論的には、軽微な表面的な逸脱を許容しつつ最初の実質的な推論失敗を特定し、測定分解能を向上させ、証明計画と低レベルの実行ノイズを分離する、計装化された証明検証パイプラインを導入する。このパイプラインを用いて、幅広いオープンおよびプロプライエタリモデルを評価する。結果は急速な進歩を示すが、依然として大きな限界がある：フロンティアモデルはいくつかの基礎的タスクでは良好に機能するが、困難なタスク、特に大域的組合せ推論や低レベルの証明合成を必要とするタスクは、解決にはほど遠い。また、認識的不安定性—モデルが欠陥のある証明を生成する一方で、それらの局所的な推論を単独では正しく拒否する現象—を特定し、これを認識的安定性指標として形式化する。最後に、精度を2パラメータ項目応答理論分析、ライトマップ、およびフィッシャー情報量に基づく正規化タスク弁別尺度で補完する。

English

We introduce ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone. ProofGrid contains 15 tasks spanning proof writing, proof checking, proof masking, and proof gap-filling. Tasks are expressed in minimal formal notation, especially NDL, a compact natural-deduction language that fits in short prompts and supports precise, auditable verification. This yields mechanical, reproducible, and fine-grained evaluation rather than judgments by humans or LLMs. ProofGrid covers a calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no current model solves, while minimizing reliance on domain knowledge, solver delegation, and long-context artifacts. We also develop a comparative framework for reasoning benchmarks and use it to situate ProofGrid relative to existing work in terms of representation, verification guarantees, and reasoning depth. Methodologically, we introduce an instrumented proof-checking pipeline that tolerates minor surface deviations while locating the first substantive reasoning failure, improving measurement resolution and separating proof planning from low-level execution noise. Using this pipeline, we evaluate a broad range of open and proprietary models. Results show rapid progress but substantial remaining limits: frontier models perform well on several foundational tasks, yet difficult tasks, especially those requiring global combinatorial reasoning or low-level proof synthesis, remain far from solved. We also identify epistemic instability, where models generate flawed proofs yet correctly reject those local inferences in isolation, and formalize this with an Epistemic Stability Index. Finally, we complement accuracy with 2PL IRT analyses, Wright maps, and a normalized task-discrimination measure based on Fisher information.