모방 게임: 튜링 머신 모방자는 길이 일반화 가능한 추론자이다

초록

길이 일반화, 즉 훈련 중 관찰된 것보다 더 긴 시퀀스의 문제를 해결하는 능력은 Transformer 기반 대규모 언어 모델(LLM)의 핵심적인 과제로 남아 있습니다. 기존 연구들은 주로 산술 연산 및 기호 조작 작업에 대한 데이터 중심 접근법에 초점을 맞추어 왔지만, 이러한 접근법은 특정 작업에 한정되어 전반적인 성능이 제한적입니다. 보다 일반적인 해결책을 모색하기 위해, 본 논문은 알고리즘으로 해결 가능한, 즉 튜링 머신이 해결할 수 있는 추론 문제의 더 넓은 범위에 초점을 맞춥니다. 이러한 관점에서, 본 논문은 LLM의 길이 일반화 능력을 향상시키기 위해 튜링 머신 모방 학습(TAIL)을 제안합니다. TAIL은 컴퓨터 프로그램을 통해 튜링 머신의 실행 과정을 모방한 사고의 연쇄(CoT) 데이터를 합성하며, 이를 통해 추론 단계를 원자 상태로 선형적으로 확장하여 단축 학습을 완화하고, 기본 연산에서 동적 및 장거리 데이터 접근의 어려움을 줄이기 위한 명시적 메모리 접근 메커니즘을 도입합니다. TAIL의 신뢰성과 보편성을 검증하기 위해, 우리는 8가지 알고리즘 클래스와 18가지 작업을 포함한 도전적인 합성 데이터셋을 구축했습니다. 별다른 장식 없이, TAIL은 합성 데이터만을 사용하여 Qwen2.5-7B의 길이 일반화 능력과 다양한 작업에서의 성능을 크게 향상시켜 이전 방법들과 DeepSeek-R1을 능가했습니다. 실험 결과는 튜링 머신의 핵심 개념이 사고 방식이 아닌 TAIL의 길이 일반화에 필수적임을 보여주며, 이를 통해 모델은 주의 계층에서 튜링 머신의 특성과 일치하는 읽기 및 쓰기 행동을 보입니다. 이 연구는 합성 데이터로부터 LLM 추론 학습을 위한 미래 연구 방향을 제시합니다.

English

Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.

모방 게임: 튜링 머신 모방자는 길이 일반화 가능한 추론자이다

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

초록

Support