생각-집중: 추론 언어 모델 성능 향상을 위한 선택적 잠재 반복 기법

초록

대규모 언어 모델(LLM)의 추론 능력 향상, 특히 매개변수 제약 조건에서의 향상은 실용적 응용에 있어 중요합니다. 기존 연구에서는 생성 품질을 높이기 위해 토큰당 고정된 추가 반복 횟수를 할당하는 순환 트랜스포머(recurrent transformer)를 제안했습니다. 첫 번째 정방향 전달(standard forward pass) 이후 언어화(verbalization) 대신, 최종 계층 은닉 상태를 입력으로 다시 공급하여 추가 반복을 통해 토큰 예측을 개선합니다. 그러나 우리는 잠재적 과도사고(latent overthinking) 현상을 확인했습니다: 첫 번째 전달에서 이미 정확하게 예측된 쉬운 토큰들이 추가 반복에서 오류로 수정되는 경우가 있습니다. 이를 해결하기 위해 우리는 어려운 토큰(hard tokens)에서만 더 깊이 반복하는 동적 잠재 사고 방법인 Think-at-Hard(TaH)를 제안합니다. TaH는 정방향 전달 후 올바르지 않을 가능성이 높은 토큰에서만 잠재 반복(latent iteration)을 트리거하는 경량 신경망 결정 장치(decider)를 사용합니다. 잠재 반복 동안 LoRA(Low-Rank Adaptation) 모듈은 LLM의 목적을 일반적인 다음 토큰 예측에서 집중적인 어려운 토큰 정제(focused hard-token refinement)로 전환합니다. 또한 우리는 토큰 시퀀스 차원에서 반복 깊이(iteration depth) 차원으로 어텐션(attention)을 확장하는 이중 인과 어텐션(duo-causal attention) 메커니즘을 도입했습니다. 이는 완전한 순차적 병렬성(sequential parallelism)을 유지하면서 교차 반복 정보 흐름(cross-iteration information flow)을 가능하게 합니다. 실험 결과, TaH는 동일한 매개변수 수를 유지하면서 다섯 가지 어려운 벤치마크에서 LLM 추론 성능을 향상시켰습니다. 모든 출력 토큰에 대해 두 번 반복하는 베이스라인과 비교했을 때, TaH는 출력 토큰의 94%를 두 번째 반복에서 제외하면서 8.1-11.3%의 정확도 향상을 달성했습니다. 동일한 데이터로 미세 조정된 강력한 단일 반복(single-iteration) Qwen3 모델과 비교해서도 4.0-5.0%의 정확도 향상을 보였습니다. LoRA 및 반복 결정 장치(iteration decider)로부터 3% 미만의 추가 매개변수만 허용하는 경우, 이 향상률은 각각 8.5-12.6% 및 5.3-5.4%로 증가했습니다. 우리의 코드는 https://github.com/thu-nics/TaH에서 확인할 수 있습니다.

English

Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.

생각-집중: 추론 언어 모델 성능 향상을 위한 선택적 잠재 반복 기법

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

초록

Support