LLMEval-Logic: 적대적 강화가 적용된 LLM의 논리적 추론을 위한 솔버 검증 중국어 벤치마크

초록

대규모 언어 모델(LLM)의 자연어 논리 추론 능력을 평가하는 것은 규칙 기반 작업에서 결론이 명시된 전제로부터 엄격히 도출되어야 하기 때문에 필수적이다. 기존의 많은 논리 추론 벤치마크는 샘플링된 공식에서 자연어 항목을 템플릿화하여 생성하고, 조잡하거나 감사되지 않은 형식적 주석만 제공하며, 현재 최첨단 추론 모델에 의해 빠르게 포화되고 있다. 우리는 현실적인 상황 시나리오로 구축된 중국어 논리 추론 벤치마크인 LLMEval-Logic을 제시한다. 이 파이프라인은 자연어 항목과 해당 참조 형식화를 사전 작성하고 전문가가 감사하며, Z3를 사용하여 주석된 답변을 검증하고, 자연어-형식어 채점을 위한 전문가 루브릭을 구축하며, 폐쇄 루프 적대적 워크플로를 통해 선별된 항목을 강화한다. 벤치마크는 두 개의 쌍을 이루는 하위 집합으로 출시된다: 1,400개의 전문가 개발 루브릭 원자(atom)가 포함된 246개 항목의 Base 하위 집합과, 폐쇄 모델 공간에 대한 938개의 다단계 하위 질문이 포함된 190개 항목의 Hard 하위 집합이다. LLMEval-Logic에서 14개의 최첨단 LLM을 평가한 결과 현재 모델의 상당한 격차가 드러났다: 최고 모델은 Hard 항목 정확도가 37.5%에 불과하며, 참조 기호를 사용하더라도 평가된 모델 중 가장 높은 Z3+루브릭 공동 형식화 점수는 60.16%에 그쳤다. 우리의 벤치마크는 https://github.com/llmeval/LLMEval-Logic에서 공개적으로 이용 가능하다.

English

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.