當代幣過度發言:跨圖像、視頻與音頻的多模態長上下文代幣壓縮綜述
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
July 27, 2025
作者: Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
cs.AI
摘要
多模态大型語言模型(MLLMs)已取得顯著進展,這主要得益於其處理日益長且複雜上下文的能力,例如高分辨率圖像、延長的視頻序列以及冗長的音頻輸入。雖然這種能力極大地增強了MLLM的功能,但也引入了顯著的計算挑戰,這主要是由於自注意力機制在處理大量輸入標記時的二次方複雜性。為緩解這些瓶頸,標記壓縮已成為一種前景廣闊且至關重要的方法,在訓練和推理過程中有效地減少了標記數量。本文首次對多模態長上下文標記壓縮這一新興領域進行了系統性的調查與綜合。認識到有效的壓縮策略與每種模態的獨特特性和冗餘密切相關,我們根據其主要數據焦點對現有方法進行了分類,使研究人員能夠快速獲取並學習針對其特定興趣領域的方法:(1)以圖像為中心的壓縮,解決視覺數據中的空間冗餘;(2)以視頻為中心的壓縮,處理動態序列中的時空冗餘;以及(3)以音頻為中心的壓縮,處理聲學信號中的時間和頻譜冗餘。除了這種模態驅動的分類外,我們還根據其基礎機制進一步剖析了方法,包括基於變換、基於相似性、基於注意力和基於查詢的方法。通過提供全面且結構化的概述,本調查旨在鞏固當前進展,識別關鍵挑戰,並激發這一快速發展領域的未來研究方向。我們還維護了一個公共存儲庫,以持續追踪和更新這一有前景領域的最新進展。
English
Multimodal large language models (MLLMs) have made remarkable strides,
largely driven by their ability to process increasingly long and complex
contexts, such as high-resolution images, extended video sequences, and lengthy
audio input. While this ability significantly enhances MLLM capabilities, it
introduces substantial computational challenges, primarily due to the quadratic
complexity of self-attention mechanisms with numerous input tokens. To mitigate
these bottlenecks, token compression has emerged as an auspicious and critical
approach, efficiently reducing the number of tokens during both training and
inference. In this paper, we present the first systematic survey and synthesis
of the burgeoning field of multimodal long context token compression.
Recognizing that effective compression strategies are deeply tied to the unique
characteristics and redundancies of each modality, we categorize existing
approaches by their primary data focus, enabling researchers to quickly access
and learn methods tailored to their specific area of interest: (1)
image-centric compression, which addresses spatial redundancy in visual data;
(2) video-centric compression, which tackles spatio-temporal redundancy in
dynamic sequences; and (3) audio-centric compression, which handles temporal
and spectral redundancy in acoustic signals. Beyond this modality-driven
categorization, we further dissect methods based on their underlying
mechanisms, including transformation-based, similarity-based, attention-based,
and query-based approaches. By providing a comprehensive and structured
overview, this survey aims to consolidate current progress, identify key
challenges, and inspire future research directions in this rapidly evolving
domain. We also maintain a public repository to continuously track and update
the latest advances in this promising area.