LLMEval-Logic：一個經求解器驗證的、具備對抗性強化的中文大型語言模型邏輯推理基準

摘要

評估大型語言模型（LLMs）在自然語言邏輯推理上的表現至關重要，因為基於規則的任務要求結論必須嚴格遵循給定的前提。許多現有的邏輯推理基準測試是透過從取樣公式中模板化自然語言項目來生成的，它們僅提供粗略或未經審核的形式化標註，且目前已迅速被前沿推理模型飽和。我們提出LLMEval-Logic，一個基於真實情境場景的中文邏輯推理基準。其流程由正向作者與專家審核共同產生自然語言項目及其參考形式化表述，利用Z3驗證註解答案，為自然語言到形式語言的評分建構專家評分量規，並透過封閉循環的對抗性工作流程強化選定項目。該基準以兩組配對子集發布：包含246個項目的基礎子集，附有1,400個專家開發的評分量規原子；以及包含190個項目的困難子集，附有938個跨越封閉模型空間的多步驟子問題。在LLMEval-Logic上評估14個前沿LLM，揭示了當前模型的顯著差距：最佳模型僅達到37.5%的困難項目準確率，即便提供參考符號，所有受評模型中最高的Z3+評分量規聯合形式化分數也僅達60.16%。我們的基準公開於 https://github.com/llmeval/LLMEval-Logic。

English

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.