超越上下文限制：长程推理中的潜意识线索

摘要

为突破大语言模型（LLMs）在推理准确性和效率上的上下文限制瓶颈，我们提出了线程推理模型（TIM），这是一类专为递归与分解式问题解决而训练的LLMs家族，以及TIMRUN，一种支持超越上下文限制的长程结构化推理的推理运行时。TIM与TIMRUN相结合，在单一语言模型推理中实现了近乎无限的工作记忆和多跳工具调用，克服了输出限制、位置嵌入约束及GPU内存瓶颈。这一性能的达成，源于我们将自然语言建模为按长度与深度衡量的推理树，而非线性序列。这些推理树由任务、思考、递归子任务及基于我们在Schroeder等人2025年提出的概念得出的结论构成。在生成过程中，我们维护一个工作记忆，仅保留最相关上下文标记的键值状态，通过基于规则的子任务剪枝机制进行选择，从而在整个推理过程中重复利用位置嵌入和GPU内存页。实验结果表明，即使在操纵GPU内存中高达90%的键值缓存时，我们的系统仍能保持高推理吞吐量，并在数学任务上实现精确推理，同时应对需要长程推理和多跳工具使用的信息检索挑战。

English

To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.

超越上下文限制：长程推理中的潜意识线索

Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning

摘要

Support