大規模言語モデルにおける複雑な推論の生成的評価

要旨

強力な大規模言語モデル（LLM）が超人的な推論能力を示す中で、重要な疑問が浮上している：LLMは本当に推論を行っているのか、それとも広範なウェブスクレイピングされたトレーニングデータセットから答えを思い出しているだけなのか？公開されたベンチマークは、後続のLLMトレーニングセットに組み込まれると必然的に汚染され、信頼性のある評価としての価値を損なう。この問題に対処するため、我々はKUMOという生成型評価フレームワークを導入した。KUMOは、LLMとシンボリックエンジンを相乗的に組み合わせ、部分的に観測可能で難易度調整可能な多段階推論タスクを動的に生成する。自動化されたパイプラインを通じて、KUMOはオープンエンドのドメインにわたる新規タスクを継続的に生成し、モデルに記憶ではなく真の一般化を示すことを強いる。我々はKUMOが作成した100のドメインにわたる5,000のタスクで23の最先端LLMを評価し、その推論能力を大学生と比較した。その結果、多くのLLMが簡単な推論タスクで大学レベルのパフォーマンスを上回り、推論スケーリングされたLLMは複雑な推論課題で大学レベルのパフォーマンスに到達することが明らかになった。さらに、KUMOタスクでのLLMのパフォーマンスは、新たにリリースされた実世界の推論ベンチマークの結果と強く相関しており、KUMOがLLMの真の推論能力を評価するための堅牢で持続可能なツールとしての価値を裏付けている。

English

With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

大規模言語モデルにおける複雑な推論の生成的評価

Generative Evaluation of Complex Reasoning in Large Language Models

要旨

Support