1つのシーケンス内での並列デコードによる並列化可能な推論の高速化

要旨

最近の推論モデルの進展により、特に数学的推論のような複雑なタスクにおいて、詳細かつ包括的な推論プロセスを採用することで、精度の大幅な向上が実証されています。しかし、これらの長い推論シーケンスを生成することは計算コストが高く、時間がかかります。この非効率性に対処するため、我々は特定のタスクに内在する並列化可能性を活用して推論プロセスを加速します。具体的には、複数の並列推論ブランチが存在する場合、特殊なアテンションマスクを使用してステップごとに複数のトークンをデコードし、それらを単一のシーケンス内で処理することで、追加のメモリ使用を回避します。実験結果は、我々の手法がデコード時間において100%以上の高速化を達成しつつ、回答品質を維持することを示しています。

English

Recent advances in reasoning models have demonstrated significant improvements in accuracy, particularly for complex tasks such as mathematical reasoning, by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning branches exist, we decode multiple tokens per step using a specialized attention mask, processing them within a single sequence, avoiding additional memory usage. Experimental results show that our method achieves over 100% speedup in decoding time while maintaining the answer quality.

1つのシーケンス内での並列デコードによる並列化可能な推論の高速化

Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

要旨

Support