Voxtral实时系统
Voxtral Realtime
February 11, 2026
作者: Alexander H. Liu, Andy Ehrenberg, Andy Lo, Chen-Yo Sun, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Rohin Arora, Sanchit Gandhi, Sandeep Subramanian, Soham Ghosh, Srijan Mishra, Abhinav Rastogi, Alan Jeffares, Albert Jiang, Alexandre Sablayrolles, Amélie Héliou, Andrew Bai, Angele Lenglemetz, Anmol Agarwal, Anton Eliseev, Antonia Calvi, Arjun Majumdar, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Benjamin Tibi, Clémence Lanfranchi, Connor Chen, Corentin Barreau, Corentin Sautier, Cyprien Courtot, Darius Dabert, Diego de las Casas, Elliot Chane-Sane, Enguerrand Paquin, Faruk Ahmed, Federico Baldassarre, Gabrielle Berrada, Gaëtan Ecrepont, Gauthier Guinet, Genevieve Hayes, Georgii Novikov, Giada Pistilli, Guillaume Martin, Gunjan Dhanuka, Gunshi Gupta, Han Zhou, Indraneel Mukherjee, Irene Zhang, Jaeyoung Kim, Jan Ludziejewski, Jason Rute, Joachim Studnia, John Harvill, Jonas Amar, Josselin Somerville Roberts, Julien Tauran, Karmesh Yadav, Kartik Khandelwal, Kush Jain, Laurence Aitchison, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Maarten Buyl, Manan Sharma, Margaret Jennings, Marie Pellat, Mark Prins, Mathieu Poirée, Mathilde Guillaumin, Matthieu Dinot, Matthieu Futeral, Maxime Darrin, Maximilian Augustin, Mert Unsal, Mia Chiquier, Nathan Grinsztajn, Neha Gupta, Olivier Bousquet, Olivier Duchenne, Patricia Wang, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philomène Chagniot, Pierre Stock, Piotr Miłoś, Prateek Gupta, Pravesh Agrawal, Quentin Torroba, Ram Ramrakhya, Rishi Shah, Romain Sauvestre, Roman Soletskyi, Rosalie Millner, Sagar Vaze, Samuel Humeau, Siddharth Gandhi, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Théo Cachet, Theo Simon Sorg, Thibaut Lavril, Thomas Chabal, Thomas Foubert, Thomas Robert, Thomas Wang, Tim Lawson, Tom Bewley, Tom Edwards, Tyler Wang, Valeriia Nemychnikova, Van Phung, Vedant Nanda, Victor Jouault, Virgile Richard, Vladislav Bataev, Wassim Bouaziz, Wen-Ding Li, William Marshall, Xinghui Li, Xingran Guo, Xinyu Yang, Yannic Neuhaus, Yihan Wang, Zaccharie Ramzi, Zhenlin Xu
cs.AI
摘要
我们推出Voxtral Realtime,这是一款原生流式自动语音识别模型,在亚秒级延迟下即可达到离线转录的同等质量。与通过分块或滑动窗口适配离线模型的方法不同,Voxtral Realtime采用端到端的流式训练方式,实现了音频流与文本流的显式对齐。我们的架构基于延迟流建模框架,引入了新型因果音频编码器和自适应RMS归一化技术以优化延迟调节。通过覆盖13种语言的大规模数据集进行预训练,在480毫秒延迟条件下,Voxtral Realtime的表现与最广泛应用的离线转录系统Whisper持平。本模型权重已基于Apache 2.0许可证开源发布。
English
We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.