General365：跨领域挑战性任务的大型语言模型通用推理基准评测

摘要

当代大型语言模型（LLMs）在数学、物理等专业领域已展现出卓越的推理能力，然而这些推理技能在更普适、更广泛场景中的迁移能力——即通用推理能力——仍待深入探索。与领域特定推理不同，通用推理较少依赖专家知识，但仍需应对复杂约束条件、嵌套逻辑分支和语义干扰等严峻挑战。为填补这一研究空白，我们推出了专门评估LLMs通用推理能力的基准测试General365。该基准通过将背景知识限定在K-12教育水平，明确实现了推理能力与专业知识的解耦。测试集包含八大类别的365道种子问题与1,095道变体问题，兼具高难度与多样性。对26个主流LLMs的评估显示，即使最优模型准确率也仅达62.8%，与LLMs在数理基准测试中接近完美的表现形成鲜明对比。这表明当前LLMs的推理能力具有显著的领域依赖性，在更广泛的应用场景中仍有巨大提升空间。我们期待General365能推动LLMs推理能力突破领域局限，向适应真实场景的鲁棒性通用推理迈进。代码、数据集及排行榜：https://general365.github.io

English

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

General365：跨领域挑战性任务的大型语言模型通用推理基准评测

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

摘要

Support