キャッシュ間通信：大規模言語モデル間の直接的な意味的コミュニケーション

要旨

マルチLLMシステムは、多様な大規模言語モデル（LLM）の補完的な強みを活用し、単一のモデルでは達成できない性能と効率の向上を実現します。既存の設計では、LLMはテキストを通じて通信し、内部表現を出力トークンシーケンスに変換することを強制されます。このプロセスは、豊かな意味情報を失うだけでなく、トークンごとの生成遅延を引き起こします。これらの制限に動機づけられ、我々は問います：LLMはテキストを超えて通信できるか？オラクル実験は、KVキャッシュの意味を豊かにすることで、キャッシュサイズを増やすことなく応答品質を向上できることを示し、KVキャッシュがモデル間通信の有効な媒体であることを支持します。そこで、我々はCache-to-Cache（C2C）を提案します。これは、LLM間の直接的な意味通信のための新しいパラダイムです。C2Cは、ニューラルネットワークを使用して、ソースモデルのKVキャッシュをターゲットモデルのKVキャッシュに投影し融合させ、直接的な意味転送を可能にします。学習可能なゲーティングメカニズムは、キャッシュ通信から利益を得るターゲット層を選択します。テキスト通信と比較して、C2Cは両モデルからの深く専門化された意味を利用し、明示的な中間テキスト生成を回避します。実験結果は、C2Cが個々のモデルよりも8.5-10.5%高い平均精度を達成することを示しています。さらに、テキスト通信パラダイムを約3.0-5.0%上回り、平均2.0倍のレイテンシ高速化を実現します。我々のコードはhttps://github.com/thu-nics/C2Cで公開されています。

English

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

キャッシュ間通信：大規模言語モデル間の直接的な意味的コミュニケーション

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

要旨

Support