视觉虫洞:异构多智能体系统中的潜空间通信
The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
February 17, 2026
作者: Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao
cs.AI
摘要
基於大型語言模型的多智能體系統雖已實現高級協作推理能力,但其離散文本通訊方式仍存在效率瓶頸——不僅帶來顯著運行時開銷,還會導致信息量化損失。儘管潛狀態傳輸提供了高帶寬替代方案,現有方法要麼假設同構的收發架構,要麼依賴於配對式專用翻譯器,難以在具有異構流形的多樣化模型族間實現可擴展的模塊化通信。本研究提出視覺蟲洞框架,創新性地重構視覺語言模型的圖像接口,實現模型無關的無文本通信。通過引入通用視覺編解碼器,我們將異構推理軌跡映射到共享連續潛空間,並直接注入接收器的視覺通路,實質上將視覺編碼器轉化為智能體間心靈感應的通用端口。該框架採用星型拓撲結構,將配對校準複雜度從O(N²)降至O(N),並利用無標籤的師生蒸餾目標,使高速視覺通道與文本通路的穩健推理模式相對齊。在異構模型族(如Qwen-VL、Gemma)上的大量實驗表明,視覺蟲洞在對照實驗中能降低端到端實時延遲,同時保持與標準文本多智能體系統相當的推理保真度。代碼已開源於:https://github.com/xz-liu/heterogeneous-latent-mas
English
Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas