缓存间直连:大型语言模型间的直接语义通信
Cache-to-Cache: Direct Semantic Communication Between Large Language Models
October 3, 2025
作者: Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
cs.AI
摘要
多大型语言模型(Multi-LLM)系统通过整合不同大型语言模型的互补优势,实现了单一模型无法达到的性能与效率提升。在现有设计中,LLM之间通过文本进行交流,迫使内部表示转化为输出标记序列。这一过程不仅丢失了丰富的语义信息,还引入了逐标记生成的延迟。针对这些局限性,我们提出疑问:LLM能否超越文本进行交流?实验表明,通过丰富KV-Cache的语义,可以在不增加缓存大小的情况下提升响应质量,这支持了KV-Cache作为模型间交流的有效媒介。因此,我们提出了Cache-to-Cache(C2C),一种直接实现LLM间语义交流的新范式。C2C利用神经网络将源模型的KV-Cache投影并融合到目标模型的KV-Cache中,从而实现直接的语义传递。一个可学习的门控机制选择那些能从缓存交流中受益的目标层。与文本交流相比,C2C充分利用了双方模型的深层、专业化语义,同时避免了显式的中间文本生成。实验结果显示,C2C比单一模型平均准确率提高了8.5-10.5%,较文本交流范式提升了约3.0-5.0%,并在延迟上实现了平均2.0倍的加速。我们的代码已发布于https://github.com/thu-nics/C2C。
English
Multi-LLM systems harness the complementary strengths of diverse Large
Language Models, achieving performance and efficiency gains unattainable by a
single model. In existing designs, LLMs communicate through text, forcing
internal representations to be transformed into output token sequences. This
process both loses rich semantic information and incurs token-by-token
generation latency. Motivated by these limitations, we ask: Can LLMs
communicate beyond text? Oracle experiments show that enriching the KV-Cache
semantics can improve response quality without increasing cache size,
supporting KV-Cache as an effective medium for inter-model communication. Thus,
we propose Cache-to-Cache (C2C), a new paradigm for direct semantic
communication between LLMs. C2C uses a neural network to project and fuse the
source model's KV-cache with that of the target model to enable direct semantic
transfer. A learnable gating mechanism selects the target layers that benefit
from cache communication. Compared with text communication, C2C utilizes the
deep, specialized semantics from both models, while avoiding explicit
intermediate text generation. Experiments show that C2C achieves 8.5-10.5%
higher average accuracy than individual models. It further outperforms the text
communication paradigm by approximately 3.0-5.0%, while delivering an average
2.0x speedup in latency. Our code is available at
https://github.com/thu-nics/C2C.