UniMIC：基于令牌的多模态交互编码助力人机协作

摘要

大型多模态模型（LMMs）与云端AI代理的快速发展，正将人机协作转变为双向、多模态的互动模式。然而，现有的编解码技术仍主要针对单模态、单向通信进行优化，导致在传统的压缩-传输-重建流程中反复出现性能下降。为解决这一局限，我们提出了UniMIC，一个基于统一令牌的多模态交互编码框架，旨在连接边缘设备与云端AI代理。UniMIC摒弃了直接传输原始像素或纯文本的做法，转而采用紧凑的令牌化表示作为通信媒介，既实现了高效的低比特率传输，又保持了与LMMs的兼容性。为进一步提升压缩效率，UniMIC引入了轻量级Transformer熵模型，其设计针对不同场景——通用型、掩码型及文本条件型——有效减少了令牌间的冗余。在文本到图像生成、文本引导的图像修复、扩展以及视觉问答等任务上的广泛实验表明，UniMIC在显著节省比特率的同时，即便在超低比特率（<0.05bpp）下也能保持稳健，且不影响下游任务性能。这些成果确立了UniMIC作为下一代多模态交互通信的实用且前瞻性范式。

English

The rapid progress of Large Multimodal Models (LMMs) and cloud-based AI agents is transforming human-AI collaboration into bidirectional, multimodal interaction. However, existing codecs remain optimized for unimodal, one-way communication, resulting in repeated degradation under conventional compress-transmit-reconstruct pipelines. To address this limitation, we propose UniMIC, a Unified token-based Multimodal Interactive Coding framework that bridges edge devices and cloud AI agents. Instead of transmitting raw pixels or plain text, UniMIC employs compact tokenized representations as the communication medium, enabling efficient low-bitrate transmission while maintaining compatibility with LMMs. To further enhance compression, lightweight Transformer-based entropy models with scenario-specific designs-generic, masked, and text-conditioned-effectively minimize inter-token redundancy. Extensive experiments on text-to-image generation, text-guided inpainting, outpainting, and visual question answering show that UniMIC achieves substantial bitrate savings and remains robust even at ultra-low bitrates (<0.05bpp), without compromising downstream task performance. These results establish UniMIC as a practical and forward-looking paradigm for next-generation multimodal interactive communication.

UniMIC：基于令牌的多模态交互编码助力人机协作

UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration

摘要

Support