Hogwild!推論：並列注意機構による大規模言語モデルの並列生成

要旨

大規模言語モデル（LLM）は、高度な推論、長文コンテンツ生成、ツールの使用を通じて、ますます複雑なタスクに取り組む能力を実証しています。これらのタスクを解決するためには、しばしば長い推論時間の計算が必要となります。人間の問題解決においては、作業を迅速化するための一般的な戦略として、問題をサブタスクに分割したり、異なる戦略を並行して探索したりする「協力」が挙げられます。最近の研究では、LLMも投票メカニズムや並列実行可能な独立したサブタスクの明示的な作成といった協力フレームワークを実装することで、並列に動作できることが示されています。しかし、これらのフレームワークはすべてのタイプのタスクに適しているわけではなく、その適用性が制限される場合があります。本研究では、異なる設計アプローチを提案します。LLMの「ワーカー」を並列に実行し、同時に更新されるアテンションキャッシュを介して同期させ、これらのワーカーに最適な協力方法を決定させるものです。このアプローチにより、各インスタンスは、並列キャッシュ内で互いの部分的な進捗を「見る」ことができながら、問題に応じた独自の協力戦略を考案することができます。このアプローチを、Hogwild!推論として実装します。Hogwild!推論は、同じLLMの複数のインスタンスが同じアテンションキャッシュを使用して並列に実行され、互いの生成したトークンに「即時」にアクセスできる並列LLM推論エンジンです。Hogwild!推論は、Rotary Position Embeddings（RoPE）を活用して再計算を回避しつつ、並列ハードウェアの利用率を向上させます。現代の推論能力を持つLLMは、追加のファインチューニングなしで、共有されたKey-Valueキャッシュを使用して推論を実行できることがわかりました。

English

Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM "workers" in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the instances to come up with their own collaboration strategy for the problem at hand, all the while "seeing" each other's partial progress in the concurrent cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with "instant" access to each other's generated tokens. Hogwild! inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

Hogwild!推論：並列注意機構による大規模言語モデルの並列生成

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

要旨

Support