General365: 多様で挑戦的なタスクにおける大規模言語モデルの汎用推論能力のベンチマーキング

要旨

現代の大規模言語モデル（LLM）は、数学や物理学などの専門領域において顕著な推論能力を示している。しかし、こうした推論スキルをより一般的で広範な文脈へ一般化する能力——しばしば汎用推論と呼ばれる——については、未だ十分に探究されていない。領域特化型の推論とは異なり、汎用推論は専門知識への依存度が低い一方で、複雑な制約条件、ネスト化された論理分岐、意味的干渉といった高度な推論上の課題を依然として含んでいる。この課題に対処するため、我々はLLMの汎用推論能力を評価するために特別に設計されたベンチマーク「General365」を提案する。General365は背景知識をK-12レベルに制限することで、推論と専門的知識とを明示的に分離する。このベンチマークは365のシード問題と8つのカテゴリに跨る1,095の変種問題で構成され、高い難易度と多様性を保証している。26の主要なLLMを用いた評価では、最高性能のモデルでも62.8%の正答率に留まり、数学や物理学のベンチマークでLLMがほぼ完璧な性能を発揮することとは対照的な結果となった。この結果は、現在のLLMの推論能力が強く領域依存していることを示唆し、より広範な応用に向けた改善の余地が大きいことを示している。我々はGeneral365が、領域特化型タスクを超え、ロバストな実世界汎用シナリオに向けたLLM推論技術の進展を促す触媒となることを期待する。コード、データセット、リーダーボード：https://general365.github.io

English

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

General365: 多様で挑戦的なタスクにおける大規模言語モデルの汎用推論能力のベンチマーキング

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

要旨

Support