ThreadWeaver:語言模型高效平行推理的自適應執行緒技術
ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models
November 24, 2025
作者: Long Lian, Sida Wang, Felix Juefei-Xu, Tsu-Jui Fu, Xiuyu Li, Adam Yala, Trevor Darrell, Alane Suhr, Yuandong Tian, Xi Victoria Lin
cs.AI
摘要
透過擴展推論階段的計算規模,大型語言模型(LLMs)已能實現強大的推理性能,但固有的序列解碼機制會導致顯著延遲,尤其在處理複雜任務時更為明顯。近期關於自適應平行推理的研究旨在透過將解題過程分解為並行的推理線程來提升推論效率,然而現有方法在實際任務中要麼僅限於監督式行為複製,要麼相較廣泛使用的序列化長思維鏈(CoT)基準出現明顯精度下降。此外,多數方法需依賴定制化推論引擎,增加了部署複雜度。我們提出ThreadWeaver框架,該自適應平行推理框架在保持與同規模主流序列推理模型相當精度的同時,顯著降低推論延遲。其性能優勢源於三項核心創新:1)兩階段平行軌跡生成器,可產製大規模具平行標註的高質量CoT數據用於監督微調;2)基於字典樹的訓練-推論協同設計,無需修改位置編碼或KV緩存即可在任意現成自迴歸推論引擎上實現平行推理;3)平行化感知的強化學習框架,指導模型在精度與有效平行化間取得平衡。在六項高難度數學推理基準測試中,基於Qwen3-8B訓練的ThreadWeaver達到與尖端序列推理模型相仿的精度(平均71.9%,AIME24上達79.9%),同時詞元延遲實現最高1.53倍加速,在精度與效率間開創了新的帕雷托前沿。
English
Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.