微型语言模型实现即时响应

摘要

诸如智能手表和智能眼镜等边缘设备因受限于功耗与算力，甚至无法持续运行最小规模的1亿至10亿参数语言模型，而云端推理又会引入数秒延迟，破坏助手的即时响应体验。我们提出微型语言模型（μLM）：这种超紧凑模型（800万至3000万参数）能在设备端即时生成基于上下文的前4-8个词响应，同时由云端模型完成后续内容，从而有效掩盖云端延迟。实验表明，在此极端规模下仍可实现实用语言生成——我们的模型性能可媲美多款现有7000万至2.56亿参数级模型。通过设计协同生成框架，将云端模型重新定义为续写者而非响应者，实现了语句中段的无缝衔接，并采用三种纠错机制在本地开场生成出错时进行结构化优雅恢复。实证结果表明，μLM能够启动由更大模型无缝接续的响应，验证了数量级不对称协同的可行性，为资源极度受限的设备开启了即时AI交互的新可能。模型检查点及演示详见https://github.com/Sensente/micro_language_model_swen_project。

English

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models (μLMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that μLMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.

微型语言模型实现即时响应

Micro Language Models Enable Instant Responses

摘要

Support