推論キャッシュ：短期強化学習による長期的な継続的改善

要旨

訓練予算を超えて継続的に改善可能な大規模言語モデル（LLM）は、推論時に適応することで、次第に難易度の高い問題を解決できる。この性質を本論文では「外挿」と呼ぶ。しかし、標準的な強化学習（RL）は固定された問題分布と訓練予算に基づいて動作するため、推論時の分布シフトにおける外挿が制限される。この問題に対処するため、我々は訓練時と推論時の両方において標準的な自己回帰的復号化を置き換える、反復的復号化アルゴリズムRCを提案する。RCは、LLMが持つ応答生成能力と要約能力の非対称性を利用し、反復を重ねるごとに一貫して改善される推論連鎖を構築する。RCを使用するように訓練されたモデルは、訓練時に経験した推論ホライズンよりも1桁以上長いホライズンにわたって外挿し、継続的に改善することができる。実証実験では、16000トークンの訓練予算で4BパラメータモデルをRCで訓練し、推論時に50万トークンを使用することで、HMMT 2025における正答率を40%から約70%に改善し、同等規模のモデルや多くの大規模推論LLMを上回った。最後に、RCで訓練されたモデルは、訓練を通じて獲得された改善された要約条件付き生成能力により、既存の支援手法をより効果的に活用して推論時の性能をさらに拡張できることも示す。

English

Large Language Models (LLMs) that can continually improve beyond their training budgets are able to solve increasingly difficult problems by adapting at test time, a property we refer to as extrapolation. However, standard reinforcement learning (RL) operates over fixed problem distributions and training budgets, which limits extrapolation amidst distribution shift at test time. To address this, we introduce RC, an iterative decoding algorithm that replaces standard autoregressive decoding during both training and inference. RC exploits an asymmetry between the response generation and summarization capabilities of LLMs to construct reasoning chains that consistently improve across iterations. Models trained to use RC can extrapolate and continually improve over reasoning horizons more than an order of magnitude longer than those seen during training. Empirically, training a 4B model with RC using a 16k-token training budget improves performance on HMMT 2025 from 40% to nearly 70% with 0.5m tokens at test time, outperforming both comparably sized models and many larger reasoning LLMs. Finally, we also show that models trained with RC can more effectively leverage existing scaffolds to further scale test-time performance, due to the improved summary-conditioned generation abilities learned through training.

推論キャッシュ：短期強化学習による長期的な継続的改善

Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

要旨

Support