ChLogic：中文表达中逻辑推理的鲁棒性评估

摘要

大型语言模型在标准化逻辑推理基准测试中的表现日益提升，但其在英语以外的语言中能否保持这种能力尚不明确。我们提出ChLogic——一个英汉对齐的基准测试，用于检验模型在相同潜在逻辑结构以英语和多种汉语表层实现形式表达时，是否仍能保持逻辑推理性能。该基准基于形式化逻辑模板构建，包含三个数据集：（i）通用对齐集，源自九个模板家族的60个通用命题；（ii）困难对齐集，源自40个困难问题；以及（iii）汉语专用集，覆盖15种语言特有现象类型。每个对齐项将一条英语参考表达与五种汉语实现形式配对。在Qwen3、Ministral和GLM模型上进行的实验揭示了持续的英汉性能差距。从标准汉语回译成英语通常能提升通用对齐集上的表现，但对困难对齐集产生混合效果——Qwen3-32B和GLM-5.1在翻译后表现更差。这些结果表明，汉语表层实现、翻译伪影及模型特定行为共同影响多语言逻辑推理。总体而言，ChLogic为多语言推理的鲁棒性提供了有效的压力测试。

English

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.