当令牌过度表达：跨图像、视频与音频的多模态长上下文令牌压缩研究综述

摘要

多模态大语言模型（MLLMs）取得了显著进展，这主要得益于其处理日益增长的长且复杂上下文的能力，如高分辨率图像、扩展的视频序列以及长时间的音频输入。尽管这一能力极大地增强了MLLM的功能，但也带来了巨大的计算挑战，主要是由于自注意力机制在处理大量输入标记时的二次方复杂度。为了缓解这些瓶颈，标记压缩作为一种前景广阔且至关重要的方法应运而生，它能在训练和推理过程中有效减少标记数量。本文首次对多模态长上下文标记压缩这一新兴领域进行了系统性的综述与整合。认识到有效的压缩策略与每种模态的独特特性及冗余密切相关，我们根据主要数据焦点对现有方法进行了分类，使研究人员能够快速获取并学习针对其特定兴趣领域的方法：（1）以图像为中心的压缩，解决视觉数据中的空间冗余；（2）以视频为中心的压缩，应对动态序列中的时空冗余；（3）以音频为中心的压缩，处理声学信号中的时间与频谱冗余。除了这种基于模态的分类外，我们还进一步根据方法的底层机制进行剖析，包括基于变换、相似性、注意力及查询的方法。通过提供全面且结构化的概览，本综述旨在巩固当前进展，识别关键挑战，并激发这一快速发展领域的未来研究方向。我们还维护了一个公共资源库，以持续追踪并更新这一充满前景领域的最新进展。

English

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.

当令牌过度表达：跨图像、视频与音频的多模态长上下文令牌压缩研究综述

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

摘要

Support