微型语言模型实现即时响应

摘要

针对智能手表、智能眼镜等边缘设备因算力与功耗限制难以持续运行百兆至十亿参数级语言模型，而云端推理又因数秒延迟破坏交互响应性的问题，我们提出微型语言模型（μLM）解决方案。这种超紧凑模型（8M-30M参数）可在设备端即时生成基于上下文的4-8词响应首段，由云端模型完成后续内容，从而有效掩盖云端延迟。实验表明，在此极端规模下仍能保持实用语言生成能力，我们的模型性能可媲美多个70M-256M参数级别的现有模型。通过将云端模型重新定义为续写者而非响应者，我们设计了协同生成框架：实现句子中段的无缝交接，并采用三种纠错机制在本地开场生成出错时实现结构化优雅恢复。实证结果显示，μLM能够启动由大模型无缝接续的响应，证明数量级不对称的协同生成具有可行性，为极度资源受限设备开启了实时AI交互新范式。模型检查点及演示见https://github.com/Sensente/micro_language_model_swen_project。

English

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models (μLMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that μLMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.