背誦勝於推理：尖端語言模型為何在基礎學校層級的推理問題上失靈？

摘要

近年來，大型語言模型（LLM）基準測試的難度從小學水平迅速攀升至前沿問題，這為研究人員編織了一個奇蹟般的幻象——我們似乎僅一步之遙就能超越人類智能。然而，LLM所展現出的卓越推理能力，究竟是基於人類標準下的真正智能，還是僅僅在互聯網規模的訓練中背誦了解決方案？為探究這一問題，我們提出了RoR-Bench，這是一個新穎的多模態基準測試，旨在檢測LLM在面對條件微妙變化的簡單推理問題時的背誦行為，並對我們的基準進行了實證分析。令人驚訝的是，我們發現現有的尖端LLM無一例外地表現出極為嚴重的背誦行為；僅僅改變條件中的一個短語，如OpenAI-o1和DeepSeek-R1等頂尖模型在小學水平的算術和推理問題上就可能遭受高達60%的性能損失。這些發現為LLM社群敲響了警鐘，迫使我們重新評估尖端LLM的真實智能水平。

English

The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60% performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.

背誦勝於推理：尖端語言模型為何在基礎學校層級的推理問題上失靈？

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

摘要

Support