快取對快取:大型語言模型間的直接語意通訊
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
October 3, 2025
作者: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
cs.AI
摘要
多大型語言模型(Multi-LLM)系統利用多樣化大型語言模型的互補優勢,實現了單一模型無法達到的性能與效率提升。在現有設計中,LLM之間通過文本進行通信,這迫使內部表示轉化為輸出標記序列。這一過程不僅損失了豐富的語義信息,還引入了逐標記生成的延遲。基於這些限制,我們提出疑問:LLM能否超越文本進行通信?預實驗表明,豐富KV-Cache的語義可以在不增加緩存大小的情況下提升響應質量,這支持了KV-Cache作為模型間通信的有效媒介。因此,我們提出了Cache-to-Cache(C2C),一種用於LLM之間直接語義通信的新範式。C2C利用神經網絡將源模型的KV-Cache投影並融合到目標模型的KV-Cache中,從而實現直接的語義傳遞。一個可學習的門控機制選擇受益於緩存通信的目標層。與文本通信相比,C2C充分利用了兩個模型的深度專用語義,同時避免了顯式的中間文本生成。實驗顯示,C2C比單個模型實現了8.5-10.5%的平均準確率提升。它進一步優於文本通信範式約3.0-5.0%,同時在延遲上實現了平均2.0倍的加速。我們的代碼可在https://github.com/thu-nics/C2C獲取。
English
Multi-LLM systems harness the complementary strengths of diverse Large
Language Models, achieving performance and efficiency gains unattainable by a
single model. In existing designs, LLMs communicate through text, forcing
internal representations to be transformed into output token sequences. This
process both loses rich semantic information and incurs token-by-token
generation latency. Motivated by these limitations, we ask: Can LLMs
communicate beyond text? Oracle experiments show that enriching the KV-Cache
semantics can improve response quality without increasing cache size,
supporting KV-Cache as an effective medium for inter-model communication. Thus,
we propose Cache-to-Cache (C2C), a new paradigm for direct semantic
communication between LLMs. C2C uses a neural network to project and fuse the
source model's KV-cache with that of the target model to enable direct semantic
transfer. A learnable gating mechanism selects the target layers that benefit
from cache communication. Compared with text communication, C2C utilizes the
deep, specialized semantics from both models, while avoiding explicit
intermediate text generation. Experiments show that C2C achieves 8.5-10.5%
higher average accuracy than individual models. It further outperforms the text
communication paradigm by approximately 3.0-5.0%, while delivering an average
2.0x speedup in latency. Our code is available at
https://github.com/thu-nics/C2C.