ChatPaper.aiChatPaper

ThreadWeaver:面向语言模型高效并行推理的自适应线程编织技术

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

November 24, 2025
作者: Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin
cs.AI

摘要

通过扩展推理时计算量,大型语言模型(LLM)已能实现强大的推理性能,但固有的串行解码机制会导致显著延迟,尤其在处理复杂任务时。近期自适应并行推理研究致力于通过将解题过程分解为并发推理线程来提升推理效率,然而现有方法在现实任务中要么仅限于监督行为克隆,要么相比广泛使用的串行长思维链(CoT)基线存在显著精度损失。此外,多数方案需定制推理引擎,增加了部署复杂度。我们提出ThreadWeaver自适应并行推理框架,在保持与同规模主流串行推理模型相当精度的同时,显著降低推理延迟。该框架的性能源于三大创新:1)两阶段并行轨迹生成器,可产出带并行标注的大规模高质量CoT数据用于监督微调;2)基于字典树(trie)的训练-推理协同设计,无需修改位置编码或KV缓存即可在任意现成自回归推理引擎上实现并行推理;3)并行感知强化学习框架,指导模型在精度与并行效率间取得平衡。在六项高难度数学推理基准测试中,基于Qwen3-8B训练的ThreadWeaver达到与前沿串行推理模型相当的精度(平均71.9%,AIME24达79.9%),同时词元延迟平均加速1.53倍,在精度与效率间确立了新的帕累托前沿。
English
Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.
PDF172December 11, 2025