マイクロ言語モデルによる即時応答の実現

要旨

スマートウォッチやスマートグラスなどのエッジデバイスは、電力および計算リソースの制約から、最小規模の100M-1Bパラメータ言語モデルですら継続的に実行することができない。一方、クラウド推論では数秒の遅延が生じ、応答性の高いアシスタントという体感を損なう。本研究では、マイクロ言語モデル（μLM）を提案する。これは超コンパクトなモデル（8M-30Mパラメータ）であり、文脈に基づいた応答の最初の4-8語をデバイス上で瞬時に生成し、クラウドモデルがそれを完成させることで、クラウドの遅延を隠蔽する。この極小規模においても有用な言語生成が可能であることを示し、我々のモデルが既存の70M-256Mクラスのモデル数種に匹敵する性能を発揮することを実証する。さらに、クラウドモデルを「応答者」ではなく「継続者」として再定義する協調生成フレームワークを設計し、文中でのシームレスな引継ぎと、ローカルで生成された冒頭部が不適切な場合の3種類の誤り訂正による構造化されたグレースフルリカバリを実現する。実験結果から、μLMが開始した応答を大規模モデルがシームレスに完成できることが示され、桁違いに非対称な協調が可能であること、ひいては極度にリソースが制約されたデバイスにおける応答性の高いAIを実現できることが明らかとなった。モデルチェックポイントとデモはhttps://github.com/Sensente/micro_language_model_swen_project で公開している。

English

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models (μLMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that μLMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.

マイクロ言語モデルによる即時応答の実現

Micro Language Models Enable Instant Responses

要旨

Support