UniMIC:基於標記的多模態互動編碼實現人機協作
UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration
September 26, 2025
作者: Qi Mao, Tinghan Yang, Jiahao Li, Bin Li, Libiao Jin, Yan Lu
cs.AI
摘要
大型多模態模型(LMMs)和基於雲端的AI代理的快速進展,正將人機協作轉變為雙向、多模態的互動。然而,現有的編解碼器仍主要針對單模態、單向通信進行優化,導致在傳統的壓縮-傳輸-重建流程中反覆出現質量下降。為解決這一限制,我們提出了UniMIC,一個基於統一令牌的多模態互動編碼框架,旨在橋接邊緣設備與雲端AI代理。UniMIC不再傳輸原始像素或純文本,而是採用緊湊的令牌化表示作為通信媒介,既實現了高效的低比特率傳輸,又保持了與LMMs的兼容性。為進一步提升壓縮效率,UniMIC引入了輕量級基於Transformer的熵模型,這些模型根據不同場景(通用、掩碼、文本條件)進行專門設計,有效減少了令牌間的冗餘。在文本到圖像生成、文本引導的修復、擴展以及視覺問答等廣泛實驗中,UniMIC展現了顯著的比特率節省,即便在超低比特率(<0.05bpp)下仍保持魯棒性,且不影響下游任務的性能。這些成果確立了UniMIC作為下一代多模態互動通信的實用且前瞻性範式。
English
The rapid progress of Large Multimodal Models (LMMs) and cloud-based AI
agents is transforming human-AI collaboration into bidirectional, multimodal
interaction. However, existing codecs remain optimized for unimodal, one-way
communication, resulting in repeated degradation under conventional
compress-transmit-reconstruct pipelines. To address this limitation, we propose
UniMIC, a Unified token-based Multimodal Interactive Coding framework that
bridges edge devices and cloud AI agents. Instead of transmitting raw pixels or
plain text, UniMIC employs compact tokenized representations as the
communication medium, enabling efficient low-bitrate transmission while
maintaining compatibility with LMMs. To further enhance compression,
lightweight Transformer-based entropy models with scenario-specific
designs-generic, masked, and text-conditioned-effectively minimize inter-token
redundancy. Extensive experiments on text-to-image generation, text-guided
inpainting, outpainting, and visual question answering show that UniMIC
achieves substantial bitrate savings and remains robust even at ultra-low
bitrates (<0.05bpp), without compromising downstream task performance. These
results establish UniMIC as a practical and forward-looking paradigm for
next-generation multimodal interactive communication.