최소 형식주의 하의 증명을 통한 LLM 추론 능력 스트레스 테스트

초록

ProofGrid는 최종 답변만이 아닌 기계적으로 검증 가능한 증명을 통해 LLM 추론을 평가하기 위한 벤치마크 제품군을 소개한다. ProofGrid는 증명 작성, 증명 검증, 증명 마스킹, 증명 공백 채우기를 포함한 15개의 과제로 구성된다. 과제는 최소한의 형식적 표기법, 특히 짧은 프롬프트에 적합하며 정밀하고 감사 가능한 검증을 지원하는 간결한 자연추론 언어인 NDL로 표현된다. 이는 인간이나 LLM의 판단이 아닌 기계적이고 재현 가능하며 세분화된 평가를 가능하게 한다. ProofGrid는 기초적 추론 테스트부터 현재 어떤 모델도 해결하지 못하는 구조적으로 풍부한 도전 과제에 이르기까지 보정된 난이도 스펙트럼을 포괄하며, 도메인 지식, 솔버 위임, 장문맥 인공물에 대한 의존도를 최소화한다. 또한, 추론 벤치마크 비교 프레임워크를 개발하여 ProofGrid를 표현, 검증 보장, 추론 깊이 측면에서 기존 연구와 비교한다. 방법론적으로, 우리는 미미한 표면적 편차를 허용하면서 첫 번째 실질적 추론 실패 지점을 식별하고, 측정 해상도를 개선하며 증명 계획과 하위 수준 실행 잡음을 분리하는 계측된 증명 검증 파이프라인을 도입한다. 이 파이프라인을 사용하여 광범위한 오픈 및 독점 모델을 평가한다. 결과는 빠른 진전을 보여주지만 상당한 한계가 여전히 존재함을 나타낸다: 최첨단 모델은 여러 기초 과제에서 우수한 성능을 보이나, 특히 전역적 조합 추론이나 하위 수준 증명 합성을 요구하는 어려운 과제는 아직 해결에 크게 미치지 못한다. 또한, 모델이 결함이 있는 증명을 생성하면서도 해당 국부적 추론을 개별적으로는 올바르게 거부하는 인식론적 불안정성을 확인하고, 이를 인식론적 안정성 지수로 공식화한다. 마지막으로, 정확도를 2PL IRT 분석, Wright 맵, Fisher 정보에 기반한 정규화된 과제 변별도로 보완한다.

English

We introduce ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone. ProofGrid contains 15 tasks spanning proof writing, proof checking, proof masking, and proof gap-filling. Tasks are expressed in minimal formal notation, especially NDL, a compact natural-deduction language that fits in short prompts and supports precise, auditable verification. This yields mechanical, reproducible, and fine-grained evaluation rather than judgments by humans or LLMs. ProofGrid covers a calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no current model solves, while minimizing reliance on domain knowledge, solver delegation, and long-context artifacts. We also develop a comparative framework for reasoning benchmarks and use it to situate ProofGrid relative to existing work in terms of representation, verification guarantees, and reasoning depth. Methodologically, we introduce an instrumented proof-checking pipeline that tolerates minor surface deviations while locating the first substantive reasoning failure, improving measurement resolution and separating proof planning from low-level execution noise. Using this pipeline, we evaluate a broad range of open and proprietary models. Results show rapid progress but substantial remaining limits: frontier models perform well on several foundational tasks, yet difficult tasks, especially those requiring global combinatorial reasoning or low-level proof synthesis, remain far from solved. We also identify epistemic instability, where models generate flawed proofs yet correctly reject those local inferences in isolation, and formalize this with an Epistemic Stability Index. Finally, we complement accuracy with 2PL IRT analyses, Wright maps, and a normalized task-discrimination measure based on Fisher information.