토큰 축소는 생성 모델에서 효율성을 넘어서야 한다 -- 비전, 언어에서 다중모달리티까지

초록

트랜스포머(Transformer) 아키텍처에서 토큰(tokens)은 원시 데이터로부터 도출된 이산적 단위로, 입력을 고정 길이의 청크로 분할하여 형성된다. 각 토큰은 임베딩으로 매핑되며, 이를 통해 입력의 핵심 정보를 보존하면서도 병렬적인 어텐션 계산이 가능해진다. 트랜스포머의 자기 어텐션(self-attention) 메커니즘은 이차 계산 복잡도를 가지기 때문에, 토큰 축소는 주로 효율성 전략으로 사용되어 왔다. 이는 특히 단일 비전 및 언어 도메인에서 계산 비용, 메모리 사용량, 추론 지연 시간을 균형 있게 조절하는 데 도움을 준다. 이러한 발전에도 불구하고, 본 논문은 대규모 생성 모델 시대에 토큰 축소가 전통적인 효율성 중심의 역할을 넘어서야 한다고 주장한다. 대신, 이를 생성 모델링의 근본적인 원칙으로 재정의하며, 모델 아키텍처와 더 넓은 응용 분야에 중요한 영향을 미칠 수 있음을 강조한다. 구체적으로, 비전, 언어, 그리고 다중모달 시스템 전반에 걸쳐 토큰 축소가 (i) 더 깊은 다중모달 통합과 정렬을 촉진하고, (ii) "과도한 사고"와 환각 현상을 완화하며, (iii) 긴 입력에 대한 일관성을 유지하고, (iv) 훈련 안정성을 향상시킬 수 있다고 주장한다. 우리는 토큰 축소를 단순한 효율성 측정 도구 이상으로 재해석한다. 이를 통해 알고리즘 설계, 강화 학습 기반 토큰 축소, 문맥 학습을 위한 토큰 최적화, 그리고 더 넓은 기계 학습 및 과학적 도메인을 포함한 유망한 미래 방향을 제시한다. 또한, 토큰 축소가 견고성을 향상시키고, 해석 가능성을 높이며, 생성 모델링의 목표와 더 잘 부합하는 새로운 모델 아키텍처와 학습 전략을 이끌어낼 잠재력을 강조한다.

English

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, and broader ML and scientific domains. We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.

토큰 축소는 생성 모델에서 효율성을 넘어서야 한다 -- 비전, 언어에서 다중모달리티까지

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

초록

Support