ThreadWeaver: 言語モデルにおける効率的な並列推論のための適応型スレッディング

要旨

推論時の計算リソース拡大により大規模言語モデル（LLM）の推論性能は大幅に向上したが、本質的に逐次的なデコード処理は複雑なタスクにおいて特に顕著な遅延を引き起こす。近年の適応的並列推論に関する研究は、問題解決プロセスを有益な場合に並列推論スレッドに分解することで推論効率の向上を目指している。しかし、現実的なタスクにおける既存手法は、教師あり行動クローニングに限定されるか、広く使われる逐次的な長い思考連鎖（CoT）ベースラインと比較して精度が大幅に低下する。さらに、多くの手法はカスタム推論エンジンを必要とし、導入を複雑にしている。我々はThreadWeaverを提案する。これは適応的並列推論フレームワークであり、同等サイズの一般的な逐次推論モデルと同等の精度を維持しつつ、推論遅延を大幅に削減する。ThreadWeaverの高性能は3つの核心的革新に由来する：1）教師ありファインチューニング向けに並列注釈付きの大規模高品質CoTデータを生成する二段階並列軌道生成器、2）位置埋め込みやKVキャッシュを変更せずに既存の自己回帰型推論エンジンで並列推論を可能にするトライ木ベースの訓練-推論協調設計、3）精度と効果的並列化のバランスをモデルに学習させる並列化意識強化学習フレームワーク。6つの難易度高い数学的推論ベンチマークにおいて、Qwen3-8B上で訓練したThreadWeaverは最先端の逐次推論モデルと同等の精度（平均71.9%、AIME24で79.9%）を達成するとともに、トークン遅延で平均1.53倍の高速化を実現し、精度と効率性の新たなパレートフロンティアを確立した。

English

Scaling inference-time computation has enabled Large Language Models (LLMs) to achieve strong reasoning performance, but inherently sequential decoding leads to substantial latency, especially on complex tasks. Recent work on adaptive parallel reasoning aims to improve inference efficiency by decomposing the problem-solving process into concurrent reasoning threads when beneficial. However, existing methods on realistic tasks are either limited to supervised behavior cloning or exhibit significant accuracy drops compared to widely-used sequential long chain-of-thought (CoT) baselines. Moreover, many require customized inference engines, complicating deployment. We introduce ThreadWeaver, a framework for adaptive parallel reasoning that achieves accuracy on par with popular sequential reasoning models of comparable size while significantly reducing inference latency. ThreadWeaver's performance stems from three key innovations: 1) a two-stage parallel trajectory generator that produces large-scale, high-quality CoT data with parallel annotations for supervised fine-tuning; 2) a trie-based training-inference co-design that enables parallel reasoning on any off-the-shelf autoregressive inference engine without modifying position embeddings or KV caches; and 3) a parallelization-aware reinforcement learning framework that teaches the model to balance accuracy with effective parallelization. Across six challenging mathematical reasoning benchmarks, ThreadWeaver trained atop Qwen3-8B achieves accuracy comparable to cutting-edge sequential reasoning models (71.9% on average and 79.9% on AIME24) while delivering up to 1.53x average speedup in token latency, establishing a new Pareto frontier between accuracy and efficiency.

ThreadWeaver: 言語モデルにおける効率的な並列推論のための適応型スレッディング

ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models

要旨

Support