Audio Flamingo 3: 完全オープンな大規模オーディオ言語モデルによる音声知能の進化

要旨

私たちは、音声、音響、音楽にわたる推論と理解を進化させた、完全にオープンな最先端（SOTA）の大規模音声言語モデル「Audio Flamingo 3（AF3）」を発表します。AF3は以下の特徴を導入しています：(i) 音声、音響、音楽の3つのモダリティにわたる共同表現学習のための新戦略を用いて訓練された統一音声エンコーダ「AF-Whisper」、(ii) 回答前に連鎖的思考型推論を行う柔軟なオンデマンド思考機能、(iii) マルチターン・マルチオーディオチャット、(iv) 最大10分に及ぶ長音声（音声を含む）の理解と推論、(v) 音声対音声インタラクション。これらの機能を実現するため、AudioSkills-XL、LongAudio-XL、AF-Think、AF-Chatなど、新戦略を用いてキュレーションされた大規模トレーニングデータセットを提案し、AF3を新たな5段階のカリキュラムベースのトレーニング戦略で訓練しました。オープンソースの音声データのみで訓練されたAF3は、20以上の（長）音声理解と推論ベンチマークで新たなSOTA結果を達成し、より大規模なデータセットで訓練されたオープンウェイトおよびクローズドソースモデルを凌駕しました。

English

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

Audio Flamingo 3: 完全オープンな大規模オーディオ言語モデルによる音声知能の進化

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

要旨

Support