UniMIC：人間-AI協働のためのトークンベース多モーダルインタラクティブコーディング

要旨

大規模マルチモーダルモデル（LMMs）とクラウドベースのAIエージェントの急速な進展により、人間とAIの協働は双方向かつマルチモーダルな相互作用へと変貌を遂げつつある。しかし、既存のコーデックは依然として単一モーダルで一方向の通信に最適化されており、従来の圧縮-伝送-再構築パイプラインにおいて繰り返し品質劣化が生じている。この課題を解決するため、我々はUniMIC（Unified token-based Multimodal Interactive Coding framework）を提案する。UniMICは、エッジデバイスとクラウドAIエージェントを橋渡しする統一されたトークンベースのマルチモーダル対話型符号化フレームワークである。生のピクセルデータや平文テキストを伝送する代わりに、UniMICはコンパクトなトークン化表現を通信媒体として採用し、LMMsとの互換性を維持しながら効率的な低ビットレート伝送を実現する。さらに圧縮効率を向上させるため、軽量なTransformerベースのエントロピーモデルをシナリオ特化型（汎用、マスク、テキスト条件付き）に設計し、トークン間の冗長性を効果的に最小化する。テキストから画像生成、テキストガイドによるインペインティング、アウトペインティング、視覚的質問応答などの広範な実験を通じて、UniMICが大幅なビットレート削減を達成し、超低ビットレート（<0.05bpp）においても下流タスクの性能を損なうことなく堅牢性を維持することを示す。これらの結果は、UniMICが次世代マルチモーダル対話型通信の実用的かつ先見的なパラダイムであることを確立する。

English

The rapid progress of Large Multimodal Models (LMMs) and cloud-based AI agents is transforming human-AI collaboration into bidirectional, multimodal interaction. However, existing codecs remain optimized for unimodal, one-way communication, resulting in repeated degradation under conventional compress-transmit-reconstruct pipelines. To address this limitation, we propose UniMIC, a Unified token-based Multimodal Interactive Coding framework that bridges edge devices and cloud AI agents. Instead of transmitting raw pixels or plain text, UniMIC employs compact tokenized representations as the communication medium, enabling efficient low-bitrate transmission while maintaining compatibility with LMMs. To further enhance compression, lightweight Transformer-based entropy models with scenario-specific designs-generic, masked, and text-conditioned-effectively minimize inter-token redundancy. Extensive experiments on text-to-image generation, text-guided inpainting, outpainting, and visual question answering show that UniMIC achieves substantial bitrate savings and remains robust even at ultra-low bitrates (<0.05bpp), without compromising downstream task performance. These results establish UniMIC as a practical and forward-looking paradigm for next-generation multimodal interactive communication.

UniMIC：人間-AI協働のためのトークンベース多モーダルインタラクティブコーディング

UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration

要旨

Support