mSCoRe：一個多語言且可擴展的技能型常識推理基準測試

摘要

近期，在推理增强的大型语言模型（LLMs）领域取得的进展，展现了其在复杂推理任务中的卓越能力。然而，关于这些模型如何运用不同人类推理技能的机制，尤其是涉及跨语言和文化的日常知识的多语言常识推理，仍鲜有深入研究。为填补这一空白，我们提出了一个多语言且可扩展的基于技能的常识推理基准（mSCoRe）。该基准包含三个关键组成部分，旨在系统评估LLM的推理能力：（1）一种新颖的推理技能分类法，支持对模型推理过程的细粒度分析；（2）专为常识推理评估设计的稳健数据合成流程；（3）一个复杂度扩展框架，使任务难度能随LLM能力的未来提升而动态调整。通过对八种不同规模和训练方法的最先进LLM进行广泛实验，我们发现mSCoRe对当前模型而言仍极具挑战性，尤其是在更高复杂度级别上。我们的结果揭示了这些推理增强模型在面对微妙的多语言通用及文化常识时的局限性。此外，我们还对模型的推理过程进行了详细分析，为提升多语言常识推理能力指明了未来研究方向。

English

Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning (mSCoRe). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that mSCoRe remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.

mSCoRe：一個多語言且可擴展的技能型常識推理基準測試

mSCoRe: a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning

摘要

Support