视觉虫洞:异构多智能体系统中的潜空间通信
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
February 17, 2026
作者: Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao
cs.AI
摘要
由大语言模型驱动的多智能体系统虽已实现高级协同推理,但仍受限于离散文本通信的低效性——这种通信方式不仅带来显著运行时开销,还会导致信息量化损失。尽管隐状态传输提供了高带宽替代方案,但现有方法要么假设同构的收发架构,要么依赖配对式学习的翻译器,难以在具有不兼容特征流形的异构模型家族间实现可扩展性与模块化。本研究提出视觉虫洞框架,通过重构视觉语言模型的视觉接口,实现与模型无关的无文本通信。我们引入通用视觉编解码器,将异构推理轨迹映射到共享连续隐空间,并直接注入接收方的视觉通路,从而将视觉编码器转化为智能体间心灵感应的通用端口。该框架采用星型拓扑将两两对齐复杂度从O(N²)降至O(N),并利用无标签的师生蒸馏目标,使高速视觉通道与文本通路的稳健推理模式相对齐。跨异构模型家族(如Qwen-VL、Gemma)的大规模实验表明,在受控对比中视觉虫洞能降低端到端实际运行时间,同时保持与标准文本多智能体系统相当的推理保真度。代码已开源:https://github.com/xz-liu/heterogeneous-latent-mas
English
Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas