Voxtral

要旨

Voxtral MiniとVoxtral Smallという2つのマルチモーダル音声チャットモデルを発表します。Voxtralは音声とテキスト文書の両方を理解するように訓練されており、多様な音声ベンチマークで最先端の性能を達成しながら、強力なテキスト処理能力を維持しています。Voxtral Smallは、ローカルで実行可能なサイズでありながら、多くのクローズドソースモデルを上回る性能を発揮します。32Kのコンテキストウィンドウにより、最大40分の音声ファイルや長いマルチターン会話を処理できます。また、知識やトリビアに関する音声理解モデルを評価するための3つのベンチマークを提供します。両VoxtralモデルはApache 2.0ライセンスの下で公開されています。

English

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while preserving strong text capabilities. Voxtral Small outperforms a number of closed-source models, while being small enough to run locally. A 32K context window enables the model to handle audio files up to 40 minutes in duration and long multi-turn conversations. We also contribute three benchmarks for evaluating speech understanding models on knowledge and trivia. Both Voxtral models are released under Apache 2.0 license.