在生成模型中，Token缩减不应仅止步于效率提升——从视觉、语言到多模态的全面考量

摘要

在Transformer架构中，通过将输入分割为固定长度的片段，形成了源自原始数据的离散单元——token。每个token随后被映射为一个嵌入向量，从而在保留输入核心信息的同时，支持并行注意力计算。鉴于Transformer自注意力机制具有二次方的计算复杂度，token缩减主要被用作一种效率优化策略。这在单模态视觉和语言领域尤为突出，有助于平衡计算成本、内存占用和推理延迟。尽管已有这些进展，本文主张，在大规模生成模型时代，token缩减应超越其传统的效率导向角色。我们将其定位为生成建模中的一项基本原则，对模型架构及更广泛的应用产生关键影响。具体而言，我们提出，在视觉、语言及多模态系统中，token缩减能够：（i）促进更深层次的多模态融合与对齐，（ii）缓解“过度思考”和幻觉现象，（iii）在长输入序列中保持连贯性，（iv）增强训练稳定性等。我们重新定义了token缩减，使其不再仅是一项效率措施。借此，我们勾勒出未来发展的广阔前景，包括算法设计、基于强化学习的token缩减指导、面向上下文学习的token优化，以及更广泛的机器学习和科学领域。我们强调，token缩减有潜力推动新型模型架构和学习策略的发展，从而提升模型的鲁棒性、增强可解释性，并更好地与生成建模的目标保持一致。

English

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.

在生成模型中，Token缩减不应仅止步于效率提升——从视觉、语言到多模态的全面考量

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

摘要

Support