SmallThinker: 로컬 배포를 위해 기본적으로 훈련된 효율적인 대규모 언어 모델 패밀리

초록

최첨단 대형 언어 모델(LLM)이 능력의 한계를 계속해서 넓혀가고 있지만, 이들의 배포는 여전히 GPU 기반 클라우드 인프라에 국한되어 있습니다. 우리는 이러한 패러다임에 도전하는 SmallThinker를 제안합니다. 이는 로컬 디바이스의 고유한 제약 조건인 약한 계산 능력, 제한된 메모리, 느린 저장 장치를 위해 처음부터 설계된 LLM 패밀리입니다. 클라우드를 위해 구축된 기존 모델을 주로 압축하는 전통적인 접근 방식과 달리, 우리는 이러한 한계 내에서도 뛰어난 성능을 발휘할 수 있도록 SmallThinker를 처음부터 설계했습니다. 우리의 혁신은 제약 조건을 설계 원칙으로 전환하는 배포 인식 아키텍처에 있습니다. 첫째, 세분화된 Mixture-of-Experts(MoE)와 희소 피드포워드 네트워크를 결합한 2단계 희소 구조를 도입하여 모델 용량을 희생하지 않으면서도 계산 요구량을 크게 줄였습니다. 둘째, 느린 저장 장치의 I/O 병목 현상을 극복하기 위해, 우리는 사전 주의 라우터를 설계하여 공동 설계된 추론 엔진이 주의 계산을 수행하는 동안 저장 장치에서 전문가 파라미터를 미리 가져올 수 있도록 하여, 그렇지 않으면 온디바이스 추론을 마비시킬 저장 장치 지연 시간을 효과적으로 숨겼습니다. 셋째, 메모리 효율성을 위해 NoPE-RoPE 하이브리드 희소 주의 메커니즘을 활용하여 KV 캐시 요구량을 크게 줄였습니다. 우리는 SmallThinker-4B-A0.6B와 SmallThinker-21B-A3B를 공개하며, 이들은 최첨단 성능 점수를 달성하고 더 큰 LLM을 능가하기까지 합니다. 특히, 우리의 공동 설계 시스템은 비싼 GPU 하드웨어의 필요성을 대부분 제거합니다: Q4_0 양자화를 통해 두 모델 모두 일반 소비자용 CPU에서 20 토큰/초를 초과하며, 각각 1GB와 8GB의 메모리만 소비합니다. SmallThinker는 hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct와 hf.co/PowerInfer/SmallThinker-21BA3B-Instruct에서 공개적으로 이용 가능합니다.

English

While frontier large language models (LLMs) continue to push capability boundaries, their deployment remains confined to GPU-powered cloud infrastructure. We challenge this paradigm with SmallThinker, a family of LLMs natively designed - not adapted - for the unique constraints of local devices: weak computational power, limited memory, and slow storage. Unlike traditional approaches that mainly compress existing models built for clouds, we architect SmallThinker from the ground up to thrive within these limitations. Our innovation lies in a deployment-aware architecture that transforms constraints into design principles. First, We introduce a two-level sparse structure combining fine-grained Mixture-of-Experts (MoE) with sparse feed-forward networks, drastically reducing computational demands without sacrificing model capacity. Second, to conquer the I/O bottleneck of slow storage, we design a pre-attention router that enables our co-designed inference engine to prefetch expert parameters from storage while computing attention, effectively hiding storage latency that would otherwise cripple on-device inference. Third, for memory efficiency, we utilize NoPE-RoPE hybrid sparse attention mechanism to slash KV cache requirements. We release SmallThinker-4B-A0.6B and SmallThinker-21B-A3B, which achieve state-of-the-art performance scores and even outperform larger LLMs. Remarkably, our co-designed system mostly eliminates the need for expensive GPU hardware: with Q4_0 quantization, both models exceed 20 tokens/s on ordinary consumer CPUs, while consuming only 1GB and 8GB of memory respectively. SmallThinker is publicly available at hf.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and hf.co/PowerInfer/SmallThinker-21BA3B-Instruct.

SmallThinker: 로컬 배포를 위해 기본적으로 훈련된 효율적인 대규모 언어 모델 패밀리

SmallThinker: A Family of Efficient Large Language Models Natively Trained for Local Deployment

초록

Support