ChatPaper.aiChatPaper

當代幣過度發言:跨圖像、視頻與音頻的多模態長上下文代幣壓縮綜述

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

July 27, 2025
作者: Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
cs.AI

摘要

多模态大型語言模型(MLLMs)已取得顯著進展,這主要得益於其處理日益長且複雜上下文的能力,例如高分辨率圖像、延長的視頻序列以及冗長的音頻輸入。雖然這種能力極大地增強了MLLM的功能,但也引入了顯著的計算挑戰,這主要是由於自注意力機制在處理大量輸入標記時的二次方複雜性。為緩解這些瓶頸,標記壓縮已成為一種前景廣闊且至關重要的方法,在訓練和推理過程中有效地減少了標記數量。本文首次對多模態長上下文標記壓縮這一新興領域進行了系統性的調查與綜合。認識到有效的壓縮策略與每種模態的獨特特性和冗餘密切相關,我們根據其主要數據焦點對現有方法進行了分類,使研究人員能夠快速獲取並學習針對其特定興趣領域的方法:(1)以圖像為中心的壓縮,解決視覺數據中的空間冗餘;(2)以視頻為中心的壓縮,處理動態序列中的時空冗餘;以及(3)以音頻為中心的壓縮,處理聲學信號中的時間和頻譜冗餘。除了這種模態驅動的分類外,我們還根據其基礎機制進一步剖析了方法,包括基於變換、基於相似性、基於注意力和基於查詢的方法。通過提供全面且結構化的概述,本調查旨在鞏固當前進展,識別關鍵挑戰,並激發這一快速發展領域的未來研究方向。我們還維護了一個公共存儲庫,以持續追踪和更新這一有前景領域的最新進展。
English
Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.
PDF212July 29, 2025