ChatPaper.aiChatPaper

ConceptMoE:面向隐式计算分配的自适应令牌到概念压缩机制

ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

January 29, 2026
作者: Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang
cs.AI

摘要

大型语言模型对所有令牌进行均匀计算分配,未能考虑某些序列可轻松预测而另一些需要深度推理的特性。我们提出ConceptMoE模型,通过动态合并语义相似的令牌形成概念表征,实现隐式的令牌级计算分配。可学习的分块模块通过测量令牌间相似度确定最优边界,在序列进入计算密集型概念模型前按目标压缩比R进行压缩。关键创新在于MoE架构支持受控评估:我们重新分配节省的计算量,使其与基线激活FLOPs(不含注意力图计算)和总参数量相匹配,从而分离出真正的架构优势。在此条件下,ConceptMoE在语言和视觉语言任务中持续超越标准MoE模型,语言预训练提升0.9个点,长上下文理解提升2.3个点,多模态基准提升0.6个点。通过层循环技术在持续训练中转换预训练MoE时,增益可达5.5个点,展现了实际应用价值。除性能提升外,ConceptMoE将注意力计算最高减少R^2倍,KV缓存减少R倍。当R=2时,实测显示长序列预填充加速达175%,解码加速达117%。极简的架构修改使其能直接集成到现有MoE中,证明自适应概念级处理从本质上提升了大语言模型的效能与效率。
English
Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio R before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to R^2times and KV cache by Rtimes. At R=2, empirical measurements show prefill speedups reaching 175\% and decoding speedups up to 117\% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.
PDF233January 31, 2026