将AI效率从模型中心压缩转向数据中心压缩

摘要

大型语言模型（LLMs）与多模态大型语言模型（MLLMs）的快速发展，历来依赖于通过将参数量从数百万提升至数千亿来实现以模型为中心的规模扩展，以此推动性能提升。然而，随着我们触及模型规模的硬件极限，主导的计算瓶颈已从根本上转向了长序列自注意力机制的二次方成本，这一现象如今由超长文本上下文、高分辨率图像及延长视频所驱动。在本立场论文中，我们主张高效人工智能的研究重心正从以模型为中心的压缩转向以数据为中心的压缩。我们将令牌压缩定位为新前沿，它通过减少模型训练或推理过程中的令牌数量来提升AI效率。通过全面分析，我们首先审视了各领域内长上下文AI的最新进展，并为现有的模型效率策略建立了一个统一的数学框架，阐明了为何令牌压缩在应对长上下文开销方面代表了一次关键的范式转变。随后，我们系统性地回顾了令牌压缩的研究现状，分析了其根本优势，并识别了其在多种场景下的显著优点。此外，我们深入剖析了当前令牌压缩研究面临的挑战，并勾勒了未来发展的光明方向。最终，我们的工作旨在为AI效率提供新视角，整合现有研究，并激发创新性发展，以应对日益增长的上下文长度对AI社区进步带来的挑战。

English

The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.

将AI效率从模型中心压缩转向数据中心压缩

Shifting AI Efficiency From Model-Centric to Data-Centric Compression

摘要

Support