并非所有LLM推理器都是平等的
Not All LLM Reasoners Are Created Equal
October 2, 2024
作者: Arian Hosseini, Alessandro Sordoni, Daniel Toyama, Aaron Courville, Rishabh Agarwal
cs.AI
摘要
我们研究了LLM在小学数学问题解决能力的深度。为此,我们评估它们在现有数学文字问题对上的表现,使得第二个问题的答案取决于正确回答第一个问题。我们的研究发现大多数LLM存在显著的推理差距,即在解决组合问题和独立解决每个问题之间的表现差异。这种差距在规模较小、成本更高效和专注于数学的模型中更为明显。此外,在LLM的不同规模上,指导调整配方和代码生成的效果各异,而在小学数学上的微调可能导致任务过拟合。我们的分析表明,大的推理差距并非由于测试集泄漏,而是由于额外上下文的干扰和第二跳推理能力不佳。总体而言,LLM在推理能力上表现出系统性差异,尽管它们在标准基准测试中的表现可能不同。
English
We study the depth of grade-school math (GSM) problem-solving capabilities of
LLMs. To this end, we evaluate their performance on pairs of existing math word
problems together so that the answer to the second problem depends on correctly
answering the first problem. Our findings reveal a significant reasoning gap in
most LLMs, that is performance difference between solving the compositional
pairs and solving each question independently. This gap is more pronounced in
smaller, more cost-efficient, and math-specialized models. Moreover,
instruction-tuning recipes and code generation have varying effects across LLM
sizes, while finetuning on GSM can lead to task overfitting. Our analysis
indicates that large reasoning gaps are not because of test-set leakage, but
due to distraction from additional context and poor second-hop reasoning.
Overall, LLMs exhibit systematic differences in their reasoning abilities,
despite what their performance on standard benchmarks indicates.Summary
AI-Generated Summary