超越上下文限制：潛意識線索驅動的長程推理

摘要

為突破大型語言模型（LLMs）在推理準確性和效率上的上下文限制瓶頸，我們提出了線程推理模型（Thread Inference Model, TIM），這是一系列專為遞歸與分解式問題解決而訓練的LLMs，以及TIMRUN，一種支持超越上下文限制的長程結構化推理的推理運行時。TIM與TIMRUN的結合，在單一語言模型推理中支持近乎無限的工作記憶和多跳工具調用，克服了輸出限制、位置嵌入約束和GPU記憶體瓶頸。此性能的實現，是通過將自然語言建模為以長度和深度衡量的推理樹，而非線性序列。這些推理樹由帶有思考的任務、遞歸子任務以及基於我們在Schroeder等人（2025年）提出的概念得出的結論組成。在生成過程中，我們維護一個工作記憶，僅保留由基於規則的子任務修剪機制選出的最相關上下文標記的鍵值狀態，從而實現了位置嵌入和GPU記憶體頁面在整個推理過程中的重複利用。實驗結果顯示，我們的系統即使在操控GPU記憶體中高達90%的鍵值緩存時，仍能保持高推理吞吐量，並在數學任務上提供精確推理，以及處理需要長程推理和多跳工具使用的信息檢索挑戰。

English

To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

超越上下文限制：潛意識線索驅動的長程推理

Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

摘要

Support