ChatPaper.aiChatPaper

InductionBench:大型語言模型在最簡單的複雜度類別中表現不佳

InductionBench: LLMs Fail in the Simplest Complexity Class

February 20, 2025
作者: Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, William Yang Wang
cs.AI

摘要

大型語言模型(LLMs)在推理能力上展現了顯著的進步,許多現有的基準測試已被如o1和o3等模型完全或部分解決。然而,這些基準測試大多側重於演繹推理,包括數學和編程任務,其中數學公理或編程語法等規則被明確定義,基於這些規則,LLMs能夠規劃並應用這些規則來得出解決方案。相比之下,歸納推理——即從觀察到的數據中推斷出潛在規則——則較少被探索。這種歸納過程是科學發現的核心,因為它使研究人員能夠從經驗觀察中提取一般原則。為了評估LLMs是否具備這種能力,我們引入了InductionBench,這是一個旨在評估LLMs歸納推理能力的新基準。我們的實驗結果顯示,即使是最先進的模型,在函數次正規層次結構中最簡單的複雜度類別中也難以掌握,這凸顯了當前LLMs在歸納推理能力上的顯著不足。相關代碼和數據可在https://github.com/Wenyueh/inductive_reasoning_benchmark獲取。
English
Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.

Summary

AI-Generated Summary

PDF72February 25, 2025