토큰이 너무 많이 말할 때: 이미지, 비디오, 오디오를 아우르는 멀티모달 장문맥 토큰 압축에 대한 조사

초록

멀티모달 대형 언어 모델(MLLMs)은 고해상도 이미지, 긴 비디오 시퀀스, 긴 오디오 입력과 같은 점점 더 길고 복잡한 컨텍스트를 처리할 수 있는 능력 덕분에 놀라운 발전을 이루었습니다. 이러한 능력은 MLLM의 성능을 크게 향상시키지만, 수많은 입력 토큰과 함께 자기 주의 메커니즘의 이차 복잡성으로 인해 상당한 계산적 문제를 야기합니다. 이러한 병목 현상을 완화하기 위해 토큰 압축이 훈련 및 추론 과정에서 토큰 수를 효율적으로 줄이는 유망하고 중요한 접근 방식으로 등장했습니다. 본 논문에서는 멀티모달 장기 컨텍스트 토큰 압축이라는 급성장하는 분야에 대한 첫 번째 체계적인 조사와 종합을 제시합니다. 효과적인 압축 전략이 각 모달리티의 고유한 특성과 중복성과 깊이 연관되어 있음을 인식하고, 기존 접근 방식을 주요 데이터 중심으로 분류하여 연구자들이 특정 관심 분야에 맞춤화된 방법을 빠르게 접하고 학습할 수 있도록 합니다: (1) 시각 데이터의 공간적 중복성을 해결하는 이미지 중심 압축, (2) 동적 시퀀스의 시공간적 중복성을 다루는 비디오 중심 압축, (3) 음향 신호의 시간적 및 스펙트럼 중복성을 처리하는 오디오 중심 압축. 이러한 모달리티 중심 분류를 넘어, 변환 기반, 유사성 기반, 주의 기반, 쿼리 기반 접근 방식과 같은 기본 메커니즘에 따라 방법들을 더욱 세분화합니다. 이 조사를 통해 포괄적이고 구조화된 개요를 제공함으로써 현재의 진전을 통합하고 주요 과제를 식별하며, 이 빠르게 진화하는 분야의 미래 연구 방향을 영감받고자 합니다. 또한, 이 유망한 분야의 최신 발전을 지속적으로 추적하고 업데이트하기 위한 공개 저장소를 유지합니다.

English

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.

토큰이 너무 많이 말할 때: 이미지, 비디오, 오디오를 아우르는 멀티모달 장문맥 토큰 압축에 대한 조사

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

초록

Support