ConceptMoE: 암시적 계산 할당을 위한 적응형 토큰-개념 압축

초록

대규모 언어 모델은 모든 토큰에 균일한 계산을 할당하여, 일부 시퀀스는 쉽게 예측 가능한 반면 다른 시퀀스는 깊은 추론이 필요하다는 점을 간과합니다. 본 연구에서는 의미론적으로 유사한 토큰을 개념 표현으로 동적으로 병합하여 암묵적인 토큰 수준 계산 할당을 수행하는 ConceptMoE를 소개합니다. 학습 가능한 청크 모듈은 토큰 간 유사성을 측정하여 최적의 경계를 식별하고, 계산 집약적인 개념 모델에 입력되기 전에 시퀀스를 목표 비율 R로 압축합니다. 중요한 것은 MoE 아키텍처가 통제된 평가를 가능하게 한다는 점입니다. 우리는 절약된 계산을 재할당하여 기준선의 활성화 FLOP(어텐션 맵 계산 제외) 및 전체 매개변수 수와 일치시킴으로써 순수한 아키텍처적 이점을 분리합니다. 이러한 조건에서 ConceptMoE는 언어 및 비전-언어 과제 전반에 걸쳐 표준 MoE를 지속적으로 능가하며, 언어 사전 학습에서 +0.9점, 장문 맥락 이해에서 +2.3점, 멀티모달 벤치마크에서 +0.6점을 달성했습니다. 레이어 루핑을 통한 지속적 학습 중 사전 학습된 MoE를 변환할 때는 향상폭이 +5.5점에 달하여 실용적인 적용 가능성을 입증했습니다. 성능 향상 외에도 ConceptMoE는 어텐션 계산을 최대 R^2배까지, KV 캐시를 R배까지 감소시킵니다. R=2일 때, 장문 시퀀스에서 프리필 속도 향상은 최대 175%, 디코딩 속도 향상은 최대 117%에 달하는 것으로 실증적으로 측정되었습니다. 최소한의 아키텍처 수정으로 기존 MoE에 직관적으로 통합될 수 있으며, 이는 적응형 개념 수준 처리가 대규모 언어 모델의 효과성과 효율성을 근본적으로 개선함을 보여줍니다.

English

Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio R before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to R^2times and KV cache by Rtimes. At R=2, empirical measurements show prefill speedups reaching 175\% and decoding speedups up to 117\% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.

ConceptMoE: 암시적 계산 할당을 위한 적응형 토큰-개념 압축

ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

초록

Support