ChatPaper.aiChatPaper

ConceptMoE:面向隱式計算分配的適應性符號到概念壓縮技術

ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

January 29, 2026
作者: Zihao Huang, Jundong Zhou, Xingwei Qu, Qiyang Min, Ge Zhang
cs.AI

摘要

大型語言模型對所有詞元均勻分配計算資源,卻忽略了某些序列可輕鬆預測而另一些則需深度推理的特性。我們提出ConceptMoE模型,能動態將語義相似的詞元合併為概念表徵,實現隱性的詞元級計算資源分配。可學習的分塊模組通過測量詞元間相似度來識別最佳邊界,在序列進入計算密集的概念模型前按目標壓縮比R進行壓縮。關鍵在於,MoE架構實現了可控評估:我們重新分配節省的計算量以匹配基準模型的激活FLOPs(不含注意力圖計算)與總參數量,從而分離出真實的架構優勢。在此條件下,ConceptMoE在語言與視覺語言任務中持續超越標準MoE模型,在語言預訓練任務上提升0.9個點,長上下文理解任務提升2.3個點,多模態基準測試提升0.6個點。通過層循環技術在持續訓練中轉換預訓練MoE模型時,增益可達5.5個點,展現出實用價值。除性能提升外,ConceptMoE將注意力計算量最高降低至R^2倍,KV緩存降低至R倍。實測顯示當R=2時,長序列的預填充速度最高提升175%,解碼速度最高提升117%。極簡的架構修改使其能無縫集成至現有MoE系統,證明自適應的概念級處理能從根本上提升大型語言模型的效能與效率。
English
Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio R before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to R^2times and KV cache by Rtimes. At R=2, empirical measurements show prefill speedups reaching 175\% and decoding speedups up to 117\% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.
PDF233January 31, 2026