InductionBench: LLM들이 가장 단순한 복잡도 클래스에서 실패하다

초록

대규모 언어 모델(LLMs)은 추론 능력에서 놀라운 발전을 보여왔으며, o1 및 o3와 같은 모델들이 기존 벤치마크의 상당 부분을 완전히 또는 부분적으로 해결해 왔습니다. 그러나 이러한 벤치마크의 대부분은 수학적 공리나 프로그래밍 구문과 같이 명확히 정의된 규칙을 바탕으로 모델이 계획을 세우고 이러한 규칙을 적용하여 해결책에 도달하는 연역적 추론, 특히 수학 및 코딩 과제에 중점을 두고 있습니다. 반면, 관찰된 데이터로부터 기본 규칙을 추론하는 귀납적 추론은 상대적으로 덜 탐구된 영역입니다. 이러한 귀납적 과정은 과학적 발견의 핵심에 위치하며, 연구자들이 경험적 관찰로부터 일반 원리를 추출할 수 있게 합니다. LLMs가 이러한 능력을 갖추고 있는지 평가하기 위해, 우리는 귀납적 추론 능력을 평가하기 위한 새로운 벤치마크인 InductionBench을 소개합니다. 우리의 실험 결과는 현재 가장 발전된 모델들조차도 하위규칙적 함수 계층 구조 내에서 가장 단순한 복잡도 클래스를 마스터하는 데 어려움을 겪는 것으로 나타나, 현재 LLMs의 귀납적 추론 능력에 있어 상당한 결함이 있음을 보여줍니다. 코드와 데이터는 https://github.com/Wenyueh/inductive_reasoning_benchmark에서 확인할 수 있습니다.

English

Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.