Technisch Rapport LongCat-Flash-Omni

Samenvatting

Wij introduceren LongCat-Flash-Omni, een state-of-the-art open-source omnimodaal model met 560 miljard parameters, dat uitblinkt in real-time audio-visuele interactie. Door een curriculum-geïnspireerde progressieve trainingsstrategie toe te passen die overgaat van eenvoudigere naar steeds complexere modaliteitssequentie-modelleringstaken, bereikt LongCat-Flash-Omni uitgebreide multimodale capaciteiten terwijl het sterke unimodale prestaties behoudt. Voortbordurend op LongCat-Flash, dat een hoogwaardige Shortcut-connected Mixture-of-Experts (MoE)-architectuur met zero-computation experts hanteert, integreert LongCat-Flash-Omni efficiënte multimodale perceptie- en spraakreconstructiemodules. Ondanks de enorme omvang van 560B parameters (waarvan 27B geactiveerd), bereikt LongCat-Flash-Omni low-latency real-time audio-visuele interactie. Voor de trainingsinfrastructuur ontwikkelden we een modaliteit-ontkoppeld parallelisme-schema specifiek ontworpen om de inherente data- en modelheterogeniteit in grootschalige multimodale training te beheren. Deze innovatieve aanpak toont uitzonderlijke efficiëntie door meer dan 90% van de doorvoer te behouden die bereikt wordt met uitsluitend teksttraining. Uitgebreide evaluaties tonen aan dat LongCat-Flash-Omni state-of-the-art prestaties behaalt op omnimodale benchmarks onder open-source modellen. Bovendien levert het zeer competitieve resultaten op een breed scala aan modaliteit-specifieke taken, waaronder tekst-, beeld- en videobegrip, evenals audiobegrip en -generatie. Wij bieden een uitgebreid overzicht van het modelarchitectuurontwerp, trainingsprocedures en datastrategieën, en open-sourcen het model om toekomstig onderzoek en ontwikkeling in de gemeenschap te bevorderen.

English

We introduce LongCat-Flash-Omni, a state-of-the-art open-source omni-modal model with 560 billion parameters, excelling at real-time audio-visual interaction. By adopting a curriculum-inspired progressive training strategy that transitions from simpler to increasingly complex modality sequence modeling tasks, LongCat-Flash-Omni attains comprehensive multimodal capabilities while maintaining strong unimodal capability. Building upon LongCat-Flash, which adopts a high-performance Shortcut-connected Mixture-of-Experts (MoE) architecture with zero-computation experts, LongCat-Flash-Omni integrates efficient multimodal perception and speech reconstruction modules. Despite its immense size of 560B parameters (with 27B activated), LongCat-Flash-Omni achieves low-latency real-time audio-visual interaction. For training infrastructure, we developed a modality-decoupled parallelism scheme specifically designed to manage the data and model heterogeneity inherent in large-scale multimodal training. This innovative approach demonstrates exceptional efficiency by sustaining over 90% of the throughput achieved by text-only training. Extensive evaluations show that LongCat-Flash-Omni achieves state-of-the-art performance on omni-modal benchmarks among open-source models. Furthermore, it delivers highly competitive results across a wide range of modality-specific tasks, including text, image, and video understanding, as well as audio understanding and generation. We provide a comprehensive overview of the model architecture design, training procedures, and data strategies, and open-source the model to foster future research and development in the community.

Technisch Rapport LongCat-Flash-Omni

LongCat-Flash-Omni Technical Report

Samenvatting

Support