深思于难：选择性潜在迭代以提升推理语言模型性能

摘要

提升大语言模型的推理能力，特别是在参数受限条件下的表现，对实际应用至关重要。已有研究提出循环Transformer架构，通过为每个token分配固定次数的额外迭代来提升生成质量。该方法在完成标准前向传播后，不直接进行语言化输出，而是将最后一层隐藏状态作为输入进行多次迭代以优化token预测结果。然而我们发现存在潜在过度思考现象：首轮预测正确的简单token在后续迭代中有时会被错误修正。针对该问题，我们提出Think-at-Hard动态潜在思考机制，仅对困难token进行深度迭代。该方法采用轻量级神经决策器，仅在标准前向传播后可能预测错误的token处触发潜在迭代。在潜在迭代过程中，低秩自适应模块将LLM目标从通用下一token预测转向专注困难token优化。我们进一步提出双因果注意力机制，将注意力范围从token序列维度扩展至迭代深度维度，在保持全序列并行性的同时实现跨迭代信息流动。实验表明，TaH在五大挑战性基准测试中均提升LLM推理性能，且参数量保持不变。与对所有输出token进行双次迭代的基线相比，TaH在免除94%token二次迭代的同时实现8.1-11.3%的准确率提升。相较于使用相同数据微调的单次迭代Qwen3模型，其准确率增益达4.0-5.0%。当允许LoRA和迭代决策器引入不足3%的额外参数时，增益分别提升至8.5-12.6%和5.3-5.4%。代码已开源：https://github.com/thu-nics/TaH。

English

Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at https://github.com/thu-nics/TaH.

深思于难：选择性潜在迭代以提升推理语言模型性能

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

摘要

Support