LLMEval-Logic：一个通过求解器验证的、带有对抗性加固的中文大语言模型逻辑推理基准

摘要

对自然语言逻辑推理能力进行评估对大型语言模型来说是至关重要的，因为规则性任务要求结论必须严格遵循给出的前提。许多现有的逻辑推理基准测试通过从采样公式中模板化生成自然语言题目，仅提供粗糙或未经审核的形式化标注，且如今已被前沿推理模型快速饱和。我们推出了LLMEval-Logic，这是一个基于真实场景情境构建的中文逻辑推理基准测试。其流程包括：前置作者与专家审核自然语言题目及其参考形式化表示，使用Z3验证标注答案，构建用于自然语言到形式化评分的专家评分标准，并通过闭环对抗式工作流程对选定题目进行加固。该基准测试以两个配对子集的形式发布：包含246道题目的基础子集，附带1,400个专家开发的评分标准原子；以及190道题目的困难子集，包含938道针对封闭模型空间的多步子问题。在LLMEval-Logic上对14个前沿大语言模型进行评估，揭示了当前模型间的显著差距：最佳模型在困难题目上的准确率仅为37.5%，即便在提供参考符号的情况下，已评估模型中最高联合Z3+评分标准形式化得分也仅达到60.16%。我们的基准测试已在 https://github.com/llmeval/LLMEval-Logic 公开发布。

English

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.