올바르게 사고하기: 적응형 주의 기반 압축을 통한 과소-과잉 사고 완화 학습

초록

최근의 사고 모델들은 테스트 시간 계산을 확장하여 복잡한 추론 과제를 해결하지만, 이러한 확장은 과제의 난이도에 맞게 할당되어야 합니다. 한편, 짧은 추론(과소 사고)은 확장된 추론 단계가 필요한 더 어려운 문제에서 오류를 초래합니다. 그러나 지나치게 긴 추론(과잉 사고)은 토큰 효율성이 떨어져, 올바른 중간 해결책에 도달한 후에도 불필요한 단계를 생성할 수 있습니다. 이를 과제 난이도에 맞게 응답 길이를 적절히 조절하지 못하는 '적응 부족'이라고 부릅니다. 적응 부족을 해결하고 과소 사고와 과잉 사고 사이의 균형을 맞추기 위해, 우리는 TRAAC(Think Right with Adaptive, Attentive Compression)을 제안합니다. TRAAC은 온라인 사후 훈련 강화 학습(RL) 방법으로, 모델의 긴 추론 궤적에 대한 자기 주의력을 활용하여 중요한 단계를 식별하고 중복된 단계를 제거합니다. 또한 TRAAC은 난이도를 추정하고 이를 훈련 보상에 통합함으로써, 예제 난이도에 맞는 추론 예산을 할당하는 방법을 학습합니다. 우리의 접근 방식은 기본 모델 및 다른 RL 베이스라인과 비교하여 정확도를 향상시키고, 추론 단계를 줄이며, 적응적 사고를 가능하게 합니다. 다양한 과제(AIME, AMC, GPQA-D, BBEH)에서 TRAAC(Qwen3-4B)은 기본 모델 대비 평균 절대 정확도 향상 8.4%와 추론 길이 상대적 감소 36.8%를 달성했으며, 최고의 RL 베이스라인 대비 7.9%의 정확도 향상과 29.4%의 길이 감소를 보였습니다. 또한 TRAAC은 강력한 일반화 능력을 보여줍니다: 우리의 모델은 수학 데이터셋으로 훈련되었지만, GPQA-D, BBEH, OptimalThinkingBench와 같은 분포 외 비수학 데이터셋에서도 정확도와 효율성 향상을 보였습니다. 우리의 분석은 TRAAC이 난이도에 기반한 세밀한 사고 예산 조정을 제공하며, 과제 난이도 보정과 주의 기반 압축의 조합이 다양한 과제에서 이점을 가져온다는 것을 추가로 검증합니다.

English

Recent thinking models solve complex reasoning tasks by scaling test-time compute, but this scaling must be allocated in line with task difficulty. On one hand, short reasoning (underthinking) leads to errors on harder problems that require extended reasoning steps; but, excessively long reasoning (overthinking) can be token-inefficient, generating unnecessary steps even after reaching a correct intermediate solution. We refer to this as under-adaptivity, where the model fails to modulate its response length appropriately given problems of varying difficulty. To address under-adaptivity and strike a balance between under- and overthinking, we propose TRAAC (Think Right with Adaptive, Attentive Compression), an online post-training RL method that leverages the model's self-attention over a long reasoning trajectory to identify important steps and prune redundant ones. TRAAC also estimates difficulty and incorporates it into training rewards, thereby learning to allocate reasoning budget commensurate with example difficulty. Our approach improves accuracy, reduces reasoning steps, and enables adaptive thinking compared to base models and other RL baselines. Across a variety of tasks (AIME, AMC, GPQA-D, BBEH), TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline. TRAAC also shows strong generalization: although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench. Our analysis further verifies that TRAAC provides fine-grained adjustments to thinking budget based on difficulty and that a combination of task-difficulty calibration and attention-based compression yields gains across diverse tasks.

올바르게 사고하기: 적응형 주의 기반 압축을 통한 과소-과잉 사고 완화 학습

Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression

초록

Support