SmallThinker:專為本地部署而生的高效大型語言模型家族
SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment
July 28, 2025
作者: Yixin Song, Zhenliang Xue, Dongliang Wei, Feiyang Chen, Jianxiang Gao, Junchen Liu, Hangyu Liang, Guangshuo Qin, Chengrong Tian, Bo Wen, Longyu Zhao, Xinrui Zheng, Zeyu Mi, Haibo Chen
cs.AI
摘要
尽管前沿的大型语言模型(LLMs)不断突破能力边界,它们的部署仍局限于依赖GPU的云端基础设施。我们通过SmallThinker系列模型挑战这一范式,这是一类原生设计——而非适配——以应对本地设备独特限制的LLMs:计算能力弱、内存有限及存储速度慢。不同于主要压缩为云端构建的现有模型的传统方法,我们从零开始构建SmallThinker,使其在这些限制下蓬勃发展。我们的创新在于采用了一种部署感知的架构,将约束转化为设计原则。首先,我们引入了一种结合细粒度专家混合(MoE)与稀疏前馈网络的双层稀疏结构,大幅降低计算需求而不牺牲模型能力。其次,为克服慢速存储的I/O瓶颈,我们设计了一个预注意力路由器,使我们的协同设计推理引擎在计算注意力的同时预取专家参数,有效隐藏了原本会严重阻碍设备端推理的存储延迟。第三,为提升内存效率,我们采用NoPE-RoPE混合稀疏注意力机制,大幅减少KV缓存需求。我们发布了SmallThinker-4B-A0.6B和SmallThinker-21B-A3B,它们不仅达到了最先进的性能评分,甚至超越了更大的LLMs。尤为显著的是,我们的协同设计系统几乎消除了对昂贵GPU硬件的需求:在Q4_0量化下,两个模型在普通消费级CPU上均超过20 tokens/s的速度,同时仅分别消耗1GB和8GB内存。SmallThinker现已公开于hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct和hf.co/PowerInfer/SmallThinker-21BA3B-Instruct。
English
While frontier large language models (LLMs) continue to push capability
boundaries, their deployment remains confined to GPU-powered cloud
infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs
natively designed - not adapted - for the unique constraints of local devices:
weak computational power, limited memory, and slow storage. Unlike traditional
approaches that mainly compress existing models built for clouds, we architect
SmallThinker from the ground up to thrive within these limitations. Our
innovation lies in a deployment-aware architecture that transforms constraints
into design principles. First, We introduce a two-level sparse structure
combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward
networks, drastically reducing computational demands without sacrificing model
capacity. Second, to conquer the I/O bottleneck of slow storage, we design a
pre-attention router that enables our co-designed inference engine to prefetch
expert parameters from storage while computing attention, effectively hiding
storage latency that would otherwise cripple on-device inference. Third, for
memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to
slash KV cache requirements. We release SmallThinker-4B-A0.6B and
SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and
even outperform larger LLMs. Remarkably, our co-designed system mostly
eliminates the need for expensive GPU hardware: with Q4_0 quantization, both
models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB
and 8GB of memory respectively. SmallThinker is publicly available at
hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and
hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.