LLM의 토큰화 제약: 기호 및 산술 추론의 한계에 대한 연구

초록

토큰화는 언어 모델에서 첫 번째이자 종종 과소평가되는 계산 계층입니다. Chain-of-Thought(CoT) 프롬프팅은 트랜스포머 모델이 중간 단계를 외부화함으로써 반복적 계산을 근사할 수 있게 하지만, 이러한 추론의 성공은 근본적으로 토큰화된 입력의 구조에 의해 제한된다는 것을 보여줍니다. 본 연구는 특히 바이트 페어 인코딩(BPE)과 같은 서브워드 기반 방법이 원자적 추론 단위를 병합하거나 모호하게 만들어 기호적 계산을 방해하는 방식에 대한 이론적 및 실증적 조사를 제시합니다. 우리는 토큰 인식(Token Awareness)이라는 개념을 도입하여, 부적절한 토큰 세분화가 논리적 정렬을 방해하고 모델이 기호적 절차를 일반화하는 것을 막는 방식을 형식화합니다. 산술 및 기호적 작업에 대한 체계적인 평가를 통해, 토큰 구조가 추론 성능에 극적인 영향을 미치며 CoT를 사용하더라도 실패를 초래하는 반면, 원자적으로 정렬된 형식은 강력한 일반화를 가능하게 하여 작은 모델(예: GPT-4o-mini)이 더 큰 시스템(예: o1)을 구조화된 추론에서 능가할 수 있음을 입증합니다. 우리의 연구 결과는 LLM의 기호적 추론 능력이 순수하게 아키텍처적인 것이 아니라 토큰 수준 표현에 깊이 조건화되어 있음을 밝혀냅니다.

English

Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.

LLM의 토큰화 제약: 기호 및 산술 추론의 한계에 대한 연구

Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits

초록

Support