박사 수준의 대형 언어 모델은 정말로 기초 덧셈을 이해하는가? 대형 언어 모델에서 규칙 학습 대 암기 능력 탐구

초록

높은 벤치마크 점수에도 불구하고, 대형 언어 모델(LLM)은 종종 간단한 문제에서 실패하며, 이는 중요한 질문을 제기합니다: LLM은 수학적 원리를 학습하는가, 아니면 단순히 패턴을 암기하는가? 최근 연구들처럼 점점 더 복잡한 벤치마크를 설계하는 대신, 우리는 기본적인 두 정수 덧셈(0에서 2^{64}까지)을 사용하여 이 문제를 조사하고, 두 가지 핵심 속성인 교환성(A+B=B+A)과 구성적 일반화(동형 기호 매핑을 통해, 예: 7 → y)를 탐구합니다. 최첨단 LLM은 수치 덧셈에서 73.8-99.8%의 정확도를 달성하지만, 기호 매핑 하에서는 성능이 ≤7.5%로 급락하여 학습된 규칙을 일반화하지 못함을 나타냅니다. 자릿수에 따른 비단조적 성능 스케일링과 빈번한 교환성 위반(A+B ≠ B+A의 경우 1,700건 이상)은 이를 더욱 뒷받침합니다. 덧셈 규칙을 명시적으로 제공하면 평균적으로 성능이 81.2% 저하되는 반면, 자기 설명은 기준 정확도를 유지하며, 이는 LLM의 산술 처리 방식이 인간이 정의한 원리와 일치하지 않음을 시사합니다. 우리의 연구 결과는 현재의 LLM이 진정한 규칙 학습보다는 메모리 패턴에 의존하고 있음을 나타내며, 진정한 수학적 추론을 달성하기 위한 새로운 접근 방식의 필요성과 아키텍처적 한계를 강조합니다.

English

Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition (0 to 2^{64}), probing two core properties: commutativity (A+B=B+A) and compositional generalization (via isomorphic symbolic mappings, e.g., 7 rightarrow y). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to leq7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of A+B neq B+A) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.

박사 수준의 대형 언어 모델은 정말로 기초 덧셈을 이해하는가? 대형 언어 모델에서 규칙 학습 대 암기 능력 탐구

Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

초록

Support