LFM2-Technischer Bericht

papers.abstract

Wir stellen LFM2 vor, eine Familie von Liquid Foundation Models, die für effiziente Bereitstellung auf Endgeräten und starke Aufgabenfähigkeiten konzipiert sind. Mittels hardware-in-the-loop Architektursuche unter Randbedingungen von Latenz und Speicherverbrauch auf Edge-Geräten erhalten wir eine kompakte hybride Backbone-Architektur, die gated Short Convolutions mit einer kleinen Anzahl von Grouped-Query-Attention-Blöcken kombiniert und im Vergleich zu Modellen ähnlicher Größe bis zu 2x schnellere Prefill- und Decode-Zeiten auf CPUs ermöglicht. Die LFM2-Familie umfasst 350M bis 8,3B Parameter, darunter dichte Modelle (350M, 700M, 1,2B, 2,6B) und eine Mixture-of-Experts-Variante (8,3B gesamt, 1,5B aktiv), alle mit einer Kontextlänge von 32K. Der Trainingspipeline von LFM2 umfasst ein temperiertes, entkoppeltes Top-K-Wissensdistillationsziel, das Support-Mismatch vermeidet; Curriculum Learning mit nach Schwierigkeit geordneten Daten; sowie ein dreistufiges Post-Training-Verfahren aus supervised Fine-Tuning, längen-normalisierter Präferenzoptimierung und Modellzusammenführung. Vortrainiert mit 10-12T Tokens erzielen LFM2-Modelle starke Ergebnisse in diversen Benchmarks; so erreicht LFM2-2.6B beispielsweise 79,56% auf IFEval und 82,41% auf GSM8K. Wir entwickeln weiterhin multimodale Varianten und eine Retrieval-Variante: LFM2-VL für Vision-Language-Aufgaben, LFM2-Audio für Sprache und LFM2-ColBERT für Retrieval. LFM2-VL unterstützt einstellbare Genauigkeits-Latenz-Kompromisse durch token-effiziente visuelle Verarbeitung, während LFM2-Audio Audio-Eingabe- und -Ausgabepfade trennt, um Echtzeit-Sprach-zu-Sprach-Interaktionen zu ermöglichen, die mit Modellen konkurrenzfähig sind, die dreimal so groß sind. LFM2-ColBERT bietet einen Encoder mit niedriger Latenz für Anfragen und Dokumente und ermöglicht hochperformantes Retrieval in mehreren Sprachen. Alle Modelle werden mit offenen Gewichten und Bereitstellungspaketen für ExecuTorch, llama.cpp und vLLM veröffentlicht, was LFM2 zu einer praktischen Basis für Edge-Anwendungen macht, die schnelle, speichereffiziente Inferenz und starke Aufgabenfähigkeiten benötigen.

English

We present LFM2, a family of Liquid Foundation Models designed for efficient on-device deployment and strong task capabilities. Using hardware-in-the-loop architecture search under edge latency and memory constraints, we obtain a compact hybrid backbone that combines gated short convolutions with a small number of grouped query attention blocks, delivering up to 2x faster prefill and decode on CPUs compared to similarly sized models. The LFM2 family covers 350M-8.3B parameters, including dense models (350M, 700M, 1.2B, 2.6B) and a mixture-of-experts variant (8.3B total, 1.5B active), all with 32K context length. LFM2's training pipeline includes a tempered, decoupled Top-K knowledge distillation objective that avoids support mismatch; curriculum learning with difficulty-ordered data; and a three-stage post-training recipe of supervised fine-tuning, length-normalized preference optimization, and model merging. Pre-trained on 10-12T tokens, LFM2 models achieve strong results across diverse benchmarks; for example, LFM2-2.6B reaches 79.56% on IFEval and 82.41% on GSM8K. We further build multimodal and retrieval variants: LFM2-VL for vision-language tasks, LFM2-Audio for speech, and LFM2-ColBERT for retrieval. LFM2-VL supports tunable accuracy-latency tradeoffs via token-efficient visual processing, while LFM2-Audio separates audio input and output pathways to enable real-time speech-to-speech interaction competitive with models 3x larger. LFM2-ColBERT provides a low-latency encoder for queries and documents, enabling high-performance retrieval across multiple languages. All models are released with open weights and deployment packages for ExecuTorch, llama.cpp, and vLLM, making LFM2 a practical base for edge applications that need fast, memory-efficient inference and strong task capabilities.