추론보다 암기: 최첨단 언어 모델이 초등학교 수준의 추론 문제에서 실패할 수 있는 이유는 무엇인가?

초록

최근 몇 년 동안 LLM 벤치마크의 난이도가 초등학교 수준에서 최첨단 문제로 급격히 상승하면서, 우리가 인간 지능을 넘어서는 데 불과 몇 걸음 남지 않았다는 기적 같은 상황이 연구자들 사이에 펼쳐졌습니다. 그러나 LLM의 놀라운 추론 능력이 정말로 인간 기준의 진정한 지능에서 비롯된 것인지, 아니면 단순히 훈련 중에 인터넷 수준에서 목격한 해결책을 암기하고 있는 것인지에 대한 의문이 제기됩니다. 이 문제를 연구하기 위해, 우리는 간단한 추론 문제를 제시하지만 조건을 미묘하게 변경하여 LLM의 암기 행동을 탐지하는 새로운 다중 모달 벤치마크인 RoR-Bench를 제안하고, 이 벤치마크에 대한 실증적 분석을 수행했습니다. 놀랍게도, 기존의 최첨단 LLM들은 모두 극심한 암기 행동을 보였습니다. 조건에서 한 구절만 변경해도 OpenAI-o1 및 DeepSeek-R1과 같은 최상위 모델들이 초등학교 수준의 산술 및 추론 문제에서 60%의 성능 저하를 겪을 수 있었습니다. 이러한 발견은 LLM 커뮤니티에 경종을 울리는 것으로, 우리가 최첨단 LLM의 진정한 지능 수준을 재평가해야 할 필요성을 강력히 시사합니다.

English

The rapid escalation from elementary school-level to frontier problems of the difficulty for LLM benchmarks in recent years have weaved a miracle for researchers that we are only inches away from surpassing human intelligence. However, is the LLMs' remarkable reasoning ability indeed comes from true intelligence by human standards, or are they simply reciting solutions witnessed during training at an Internet level? To study this problem, we propose RoR-Bench, a novel, multi-modal benchmark for detecting LLM's recitation behavior when asked simple reasoning problems but with conditions subtly shifted, and conduct empirical analysis on our benchmark. Surprisingly, we found existing cutting-edge LLMs unanimously exhibits extremely severe recitation behavior; by changing one phrase in the condition, top models such as OpenAI-o1 and DeepSeek-R1 can suffer 60% performance loss on elementary school-level arithmetic and reasoning problems. Such findings are a wake-up call to the LLM community that compels us to re-evaluate the true intelligence level of cutting-edge LLMs.

추론보다 암기: 최첨단 언어 모델이 초등학교 수준의 추론 문제에서 실패할 수 있는 이유는 무엇인가?

Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

초록

Support