超越上下文限制:长程推理中的潜意识线索
Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
July 22, 2025
作者: Hongyin Luo, Nathaniel Morgan, Tina Li, Derek Zhao, Ai Vy Ngo, Philip Schroeder, Lijie Yang, Assaf Ben-Kish, Jack O'Brien, James Glass
cs.AI
摘要
为突破大语言模型(LLMs)在推理准确性和效率上的上下文限制瓶颈,我们提出了线程推理模型(TIM),这是一类专为递归与分解式问题解决而训练的LLMs家族,以及TIMRUN,一种支持超越上下文限制的长程结构化推理的推理运行时。TIM与TIMRUN相结合,在单一语言模型推理中实现了近乎无限的工作记忆和多跳工具调用,克服了输出限制、位置嵌入约束及GPU内存瓶颈。这一性能的达成,源于我们将自然语言建模为按长度与深度衡量的推理树,而非线性序列。这些推理树由任务、思考、递归子任务及基于我们在Schroeder等人2025年提出的概念得出的结论构成。在生成过程中,我们维护一个工作记忆,仅保留最相关上下文标记的键值状态,通过基于规则的子任务剪枝机制进行选择,从而在整个推理过程中重复利用位置嵌入和GPU内存页。实验结果表明,即使在操纵GPU内存中高达90%的键值缓存时,我们的系统仍能保持高推理吞吐量,并在数学任务上实现精确推理,同时应对需要长程推理和多跳工具使用的信息检索挑战。
English
To break the context limits of large language models (LLMs) that bottleneck
reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM),
a family of LLMs trained for recursive and decompositional problem solving, and
TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond
context limits. Together, TIM hosted on TIMRUN supports virtually unlimited
working memory and multi-hop tool calls within a single language model
inference, overcoming output limits, positional-embedding constraints, and
GPU-memory bottlenecks. Performance is achieved by modeling natural language as
reasoning trees measured by both length and depth instead of linear sequences.
The reasoning trees consist of tasks with thoughts, recursive subtasks, and
conclusions based on the concept we proposed in Schroeder et al, 2025. During
generation, we maintain a working memory that retains only the key-value states
of the most relevant context tokens, selected by a rule-based subtask-pruning
mechanism, enabling reuse of positional embeddings and GPU memory pages
throughout reasoning. Experimental results show that our system sustains high
inference throughput, even when manipulating up to 90% of the KV cache in GPU
memory. It also delivers accurate reasoning on mathematical tasks and handles
information retrieval challenges that require long-horizon reasoning and
multi-hop tool use.