トークン削減は生成モデルにおいて効率性を超えるべきである - 視覚、言語からマルチモーダリティへ

要旨

Transformerアーキテクチャにおいて、トークン\textemdash 生データから導出される離散単位\textemdash は、入力を固定長のチャンクに分割することで形成されます。各トークンはその後、埋め込みにマッピングされ、入力の本質的な情報を保ちつつ並列的な注意計算を可能にします。Transformerの自己注意機構の二次的な計算複雑性のため、トークン削減は主に効率化戦略として用いられてきました。これは特に単一の視覚と言語の領域において、計算コスト、メモリ使用量、推論遅延のバランスを取るのに役立っています。これらの進歩にもかかわらず、本論文では、大規模生成モデルの時代において、トークン削減は従来の効率重視の役割を超えるべきだと主張します。代わりに、我々はそれを生成モデリングにおける基本原理として位置づけ、モデルアーキテクチャと幅広い応用に重大な影響を与えるものとします。具体的には、視覚、言語、マルチモーダルシステムにわたって、トークン削減が以下のことを可能にすると主張します：(i) より深いマルチモーダル統合とアラインメントを促進、(ii) 「過剰思考」や幻覚を軽減、(iii) 長い入力にわたって一貫性を維持、(iv) 訓練の安定性を向上、など。我々はトークン削減を単なる効率化手段以上のものとして再定義します。これにより、アルゴリズム設計、強化学習に基づくトークン削減、文脈内学習のためのトークン最適化、そしてより広範な機械学習や科学分野を含む有望な将来の方向性を概説します。我々は、堅牢性を向上させ、解釈可能性を高め、生成モデリングの目的により良く整合する新しいモデルアーキテクチャと学習戦略を推進する可能性を強調します。

English

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.

トークン削減は生成モデルにおいて効率性を超えるべきである - 視覚、言語からマルチモーダリティへ

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

要旨

Support