思維深化：透過選擇性潛在迭代提升推理語言模型效能

摘要

提升大型語言模型（LLMs）的推理能力，特別是在參數受限條件下，對實際應用至關重要。先前研究提出循環變換器架構，通過為每個詞元分配固定額外迭代次數來提升生成質量。在首次標準前向傳播後，該方法不直接進行語言化輸出，而是將最後一層的隱藏狀態反饋為輸入進行多次迭代以優化詞元預測。然而我們發現了一種潛在的過度思考現象：首次傳播後已正確預測的簡單詞元，在後續迭代中有時反而被修正為錯誤預測。為解決此問題，我們提出「難點思考」動態潛在迭代機制，該方法僅對困難詞元進行深度迭代。它採用輕量級神經決策器，僅在標準前向傳播後可能預測錯誤的詞元處觸發潛在迭代。在潛在迭代過程中，低秩自適應模塊將LLM的目標從通用下一詞元預測轉向聚焦於困難詞元的精細化修正。我們進一步引入雙因果注意力機制，將注意力範圍從詞元序列維度擴展至額外的迭代深度維度。此設計在保持完全序列並行性的同時，實現了跨迭代的信息流動。實驗表明，TaH在五個高難度基準測試中均能提升LLM推理性能，且保持參數總量不變。與對所有輸出詞元進行雙次迭代的基準方法相比，TaH在免除94%詞元二次迭代的同時，實現了8.1-11.3%的準確率提升。相比使用相同數據微調的強力單次迭代Qwen3模型，其準確率增益也達到4.0-5.0%。當允許LoRA和迭代決策器引入不足3%的附加參數時，增益分別提升至8.5-12.6%和5.3-5.4%。代碼已開源於：https://github.com/thu-nics/TaH。

English

Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.

思維深化：透過選擇性潛在迭代提升推理語言模型效能

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

摘要

Support