General365: 다양하고 도전적인 과제에서 대규모 언어 모델의 일반적 추론 능력 벤치마킹

초록

현대의 대규모 언어 모델(LLM)은 수학 및 물리학과 같은 특정 분야에서 특히 놀라운 추론 능력을 보여주고 있습니다. 그러나 이러한 추론 기술을 보다 일반적이고 광범위한 맥락에 일반화하는 능력, 즉 일반 추론(general reasoning)에 대해서는 아직 연구가 충분히 이루어지지 않았습니다. 영역 특화적 추론과 달리 일반 추론은 전문 지식에 덜 의존하지만, 복잡한 제약 조건, 중첩된 논리적 분기, 의미적 간섭과 같은 난해한 추론 과제를 여전히 제시합니다. 이러한 격차를 해소하기 위해 본 연구에서는 LLM의 일반 추론 능력을 평가하기 위해 특별히 설계된 벤치마크인 General365를 소개합니다. 배경 지식을 K-12 수준으로 제한함으로써 General365는 추론과 전문적 지식을 명시적으로 분리합니다. 이 벤치마크는 8개 범주에 걸쳐 365개의 시드 문제와 1,095개의 변형 문제로 구성되어 높은 난이도와 다양성을 모두 보장합니다. 26개의 주요 LLM에 대한 평가 결과, 최고 성능 모델조차도 정확도가 62.8%에 그쳐 수학 및 물리학 벤치마크에서 LLM이 보여준 거의 완벽한 성능과 대조를 보였습니다. 이러한 결과는 현재 LLM의 추론 능력이 특정 영역에 크게 의존적이어서 광범위한 적용에 있어 개선의 여지가 크다는 것을 시사합니다. 우리는 General365가 영역 특화 작업을 넘어 강력하고 일반적인 실제 시나리오를 위한 LLM 추론 기술 발전의 촉매제가 될 것으로 기대합니다. 코드, 데이터셋 및 리더보드: https://general365.github.io

English

Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io

General365: 다양하고 도전적인 과제에서 대규모 언어 모델의 일반적 추론 능력 벤치마킹

General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks

초록

Support