ChLogic：評估中文表達中邏輯推理的魯棒性

摘要

大型語言模型在標準化邏輯推理基準測試中表現日益出色，但此能力在英語以外的語言中是否依然穩健，目前尚不清楚。我們提出 ChLogic，這是一個中英文對齊的基準測試，旨在測試當相同的隱含邏輯結構分別以英語及多樣化的中文表層實現呈現時，模型能否維持其邏輯推理表現。該基準測試基於形式邏輯模板建構，包含三個資料集：(i) 一般對齊集，衍自九個模板家族共60條一般命題；(ii) 困難對齊集，衍自40道困難問題；以及 (iii) 僅限中文集，涵蓋15種語言特有現象類型。每個對齊項目配對一個英文參考表達與五個中文實現。針對 Qwen3、Ministral 及 GLM 模型進行的實驗顯示，中英文之間存在持續的表現差距。從標準中文回譯成英文通常能提升一般對齊集的表現，但對困難對齊集則產生混合效果，其中 Qwen3-32B 與 GLM-5.1 在回譯後表現反而下降。這些結果表明，中文表層實現、回譯造成的痕跡以及模型特定行為，共同影響多語言邏輯推理。總體而言，ChLogic 為多語言推理的穩健性提供了一項有用的壓力測試。

English

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same latent logical structure is expressed in English and diverse Chinese surface realizations. Built from formal logical templates, the benchmark contains three data sets: (i) the General aligned set, derived from 60 General Propositions across nine template families; (ii) the Difficult aligned set, derived from 40 Difficult Problems; and (iii) the Chinese-only set, covering 15 language-specific phenomenon types. Each aligned item pairs one English reference expression with five Chinese realizations. Experiments on Qwen3, Ministral, and GLM models reveal a persistent English--Chinese performance gap. Back-translation from standard Chinese into English often improves performance on the General aligned set, but produces mixed effects on the Difficult aligned set, where Qwen3-32B and GLM-5.1 perform worse after translation. These results indicate that Chinese surface realization, translation artifacts, and model-specific behavior jointly affect multilingual logical reasoning. Overall, ChLogic provides a useful stress test for the robustness of multilingual reasoning.