마르코프적 사고자

초록

강화 학습(Reinforcement Learning, RL)은 최근 긴 사고 연쇄(Long Chain of Thought, LongCoT)를 생성하는 추론 대형 언어 모델(LLM)을 훈련시키는 강력한 방법으로 자리 잡았습니다. 그러나 표준 RL "사고 환경"에서는 상태가 프롬프트와 이전의 모든 사고 토큰으로 구성되기 때문에 상태가 무한히 커지고, 사고가 길어질수록 주의 기반 정책이 2차 계산 비용을 지불해야 합니다. 우리는 이 환경 자체를 재검토합니다. 우리는 Markovian Thinking이라는 패러다임을 제안합니다. 이는 정책이 일정한 크기의 상태를 조건으로 하여 사고를 진행하면서 사고 길이와 컨텍스트 크기를 분리하는 방식입니다. 이로 인해 즉각적인 결과로 선형 계산과 일정한 메모리 사용이 가능해집니다. 우리는 이 아이디어를 Delethink이라는 RL 환경으로 구체화했습니다. Delethink은 사고를 고정 크기의 청크로 구조화합니다. 각 청크 내에서는 모델이 평소처럼 사고를 진행하고, 경계에서는 환경이 컨텍스트를 재설정하고 짧은 이월 정보로 프롬프트를 다시 초기화합니다. RL을 통해 정책은 각 청크의 끝 부분에서 재설정 후에도 원활한 사고 연속을 위한 충분한 텍스트 상태를 작성하는 법을 배웁니다. 이 환경에서 훈련된 R1-Distill 1.5B 모델은 8K 토큰 청크 내에서 사고를 진행하면서도 최대 24K 토큰까지 사고할 수 있으며, 24K 예산으로 훈련된 LongCoT-RL과 동등하거나 더 나은 성능을 보입니다. 테스트 시 스케일링에서 Delethink은 LongCoT가 정체되는 지점에서도 계속해서 개선됩니다. 선형 계산의 효과는 상당합니다: 우리는 96K 평균 사고 길이에서 LongCoT-RL이 27 H100-월의 비용이 드는 반면, Delethink은 7 H100-월의 비용이 든다는 것을 실증적으로 추정했습니다. RL 초기화 시 분석은 다양한 벤치마크에서 기성 추론 모델(1.5B-120B)이 제로샷으로 Markovian 흔적을 샘플링하는 경우가 많음을 보여주며, 이는 RL이 대규모로 효과적일 수 있는 긍정적인 샘플을 제공합니다. 우리의 결과는 사고 환경을 재설계하는 것이 매우 강력한 도구임을 보여줍니다: 이는 2차 오버헤드 없이 매우 긴 사고를 가능하게 하고, 효율적이고 확장 가능한 추론 LLM을 향한 길을 열어줍니다.

English

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

마르코프적 사고자

The Markovian Thinker

초록

Support