LLMリーソナーはすべてが同じとは限りません。

要旨

私たちはLLMの小学校数学（GSM）問題解決能力の深さを研究しています。このために、既存の数学の文章問題のペアで、2つ目の問題の答えが最初の問題を正しく解答することに依存するようにして、彼らのパフォーマンスを評価します。我々の調査結果は、ほとんどのLLMにおいて論理的なギャップがあることを示しており、それは構成ペアを解決することと各問題を独立して解決することとのパフォーマンスの違いです。このギャップは、より小さく、コスト効率が高く、数学に特化したモデルではより顕著です。さらに、指示調整のレシピやコード生成は、LLMのサイズによって異なる効果を持ちますが、GSMでのファインチューニングはタスクの過剰適合を引き起こす可能性があります。我々の分析は、大きな論理的なギャップがテストセットの漏洩ではなく、追加の文脈からの注意散漫と第2段階の推論能力の低さによるものであることを示しています。全体として、LLMは標準ベンチマークでのパフォーマンスが示す内容とは異なる推論能力の系統的な違いを示しています。

English

We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

LLMリーソナーはすべてが同じとは限りません。

Not All LLM Reasoners Are Created Equal

要旨

Summary

Support

Support