mSCoRe: スキルベースの常識推論のための多言語・スケーラブルなベンチマーク

要旨

推論能力を強化した大規模言語モデル（LLM）の最近の進展は、複雑な推論タスクにおいて顕著な能力を示しています。しかし、異なる人間の推論スキルを活用するメカニズムについては、特に異なる言語や文化にまたがる日常的な知識を必要とする多言語常識推論において、十分に調査されていません。このギャップを埋めるため、我々はスキルベースの常識推論のための多言語かつスケーラブルなベンチマーク（mSCoRe）を提案します。本ベンチマークは、LLMの推論能力を体系的に評価するために設計された3つの主要な要素を組み込んでいます。具体的には、(1) モデルの推論プロセスを詳細に分析するための新しい推論スキルの分類体系、(2) 常識推論評価に特化した堅牢なデータ合成パイプライン、(3) LLMの能力向上に伴ってタスクの難易度を動的にスケーリングする複雑度スケーリングフレームワークです。さまざまなサイズとトレーニング手法を採用した8つの最先端LLMを用いた広範な実験により、mSCoReが特に高複雑度レベルにおいて、現在のモデルにとって依然として非常に困難であることが示されました。結果から、推論能力を強化したモデルが、微妙な多言語一般常識や文化的常識に直面した際の限界が明らかになりました。さらに、モデルの推論プロセスに関する詳細な分析を提供し、多言語常識推論能力を向上させるための今後の方向性を示唆します。

English

Recent advancements in reasoning-reinforced Large Language Models (LLMs) have shown remarkable capabilities in complex reasoning tasks. However, the mechanism underlying their utilization of different human reasoning skills remains poorly investigated, especially for multilingual commonsense reasoning that involves everyday knowledge across different languages and cultures. To address this gap, we propose a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning (mSCoRe). Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities, including: (1) a novel taxonomy of reasoning skills that enables fine-grained analysis of models' reasoning processes, (2) a robust data synthesis pipeline tailored specifically for commonsense reasoning evaluation, and (3) a complexity scaling framework allowing task difficulty to scale dynamically alongside future improvements in LLM abilities. Extensive experiments on eights state-of-the-art LLMs of varying sizes and training approaches demonstrate that mSCoRe remains significantly challenging for current models, particularly at higher complexity levels. Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense. We further provide detailed analysis on the models' reasoning processes, suggesting future directions for improving multilingual commonsense reasoning capabilities.

mSCoRe: スキルベースの常識推論のための多言語・スケーラブルなベンチマーク

mSCoRe: a Multilingual and Scalable Benchmark for Skill-based Commonsense Reasoning

要旨

Support