私が見ているものを見て、私が考えていることを知る：異種エージェント間の高密度潜在コミュニケーション

要旨

マルチエージェントシステムは主にテキストを介して通信を行い、損失が大きく高コストなデコードと再エンコードを伴う。KVキャッシュ通信は有望な代替手段であるが、既存研究のほとんどは同一モデルの複製を用いた均質な設定に限られており、異種モデル間の潜在表現のアライメントという中心的な課題を回避している。また、既存の異種モデル手法にも制約があり、通常は入力を共有し、転送されたキャッシュを主に誘導に利用するものに限定されている。本研究では、より根本的な問いを追究する。すなわち、異種エージェント間で真の「マインドリーディング」が可能なほど十分にアライメントされ、あるエージェントが「何を見ているか」だけでなく「どのように考えているか」をも転送できるのか、という問いである。我々の情報構造分析は、二重性を明らかにする。すなわち、コンテキストを認識した転送は疎な推論シグナルによって駆動される一方、受信側が入力を一切見ないコンテキスト非認識の転送では、密な文脈知識の保存が必要となる。この知見に基づき、我々は軽量なクロスモデルキャッシュ変換と、再構成と生成の2段階学習からなる、異種KVキャッシュ通信のための密なアライメント手法を提案する。{Qwen3-4B, 8B, 14B}の全6方向と、ドメイン内・ドメイン外の6つのベンチマークにおいて、本手法は従来の異種ベースラインを上回り、コンテキストを認識した設定ではテキスト通信と同等かそれ以上に優れた性能を、およそ2～3倍の計算コスト削減で達成し、従来手法が機能しなかったコンテキスト非認識の転送においても有効性を示す。

English

Multi-agent systems communicate mostly through text, paying a lossy and expensive decode and re-encode cost. KV-cache communication is a promising alternative, yet most prior work is homogeneous, using duplicate copies of the same model, and avoids the central challenge of cross-model latent alignment; existing heterogeneous methods are also restrictive, typically assuming shared input and using transferred caches mainly for steering. We study a more fundamental question: can heterogeneous agents be aligned well enough to perform real "mind reading" and transfer both what one agent sees and how it thinks? Our information-structure analysis reveals a duality: context-aware transfer is driven by sparse reasoning signals, while context-unaware transfer, where the receiver sees no input, requires dense contextual knowledge preservation. Motivated by this, we propose dense alignment for heterogeneous KV-cache communication via a lightweight cross-model cache transformation and two-phase training: reconstruction followed by generation. Across all six directions of {Qwen3-4B, 8B, 14B} and six in-domain and out-of-domain benchmarks, our method outperforms prior heterogeneous baselines, matches or exceeds text communication in context-aware settings at roughly 2 to 3 times lower compute, and remains effective in context-unaware transfer where prior methods collapse.