Audio Flamingo Next：音声・サウンド・音楽のための次世代オープン音声言語モデル

要旨

本論文では、Audio Flamingoシリーズの次世代モデルであり、音声・環境音・音楽の理解と推論を革新する最も高性能な大規模音声言語モデル「Audio Flamingo Next（AF-Next）」を提案する。Audio Flamingo 3と比較し、AF-Nextは以下の新機能を導入する：（i）多様な音声理解タスクで精度を大幅に向上させる強固な基盤音声言語モデル、（ii）既存の学術ベンチマークを超える大規模音声理解・推論データの構築に向けたスケーラブルな戦略、（iii）最大30分の長大で複雑な音声入力への対応、（iv）長時間音声において中間推論ステップをタイムスタンプに明示的に紐付ける新推論パラダイム「Temporal Audio Chain-of-Thought」による細粒度時間対応と解釈可能性の向上。これらの機能を実現するため、我々はまずAudio Flamingo 3の体系的分析により音声理解・推論の主要課題を特定した。その後、合計100万時間超の新規大規模データセットを構築し、既存のAudioSkills-XL、LongAudio-XL、AF-Think、AF-Chatデータセットを拡張。AF-Nextはプレトレーニング、ミッドトレーニング、ポストトレーニングの課程学習戦略で訓練される。長時間音声タスクを含む20の音声理解・推論ベンチマークによる大規模評価では、AF-Nextが同規模のオープンモデルを大幅に上回り、より大規模なオープンウェイト／クローズドモデルに対しても高い競争力を示し、時に凌駕することを実証した。ベンチマーク性能を超え、AF-Nextは実世界での高い有用性と未見タスクへの優れた転移性能を示し、頑健性と汎化能力を強調する。全てのデータ・コード・手法に加え、AF-Next-Instruct、AF-Next-Think、AF-Next-Captionerの3変種をオープンソース化する。

English

We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.

Audio Flamingo Next：音声・サウンド・音楽のための次世代オープン音声言語モデル

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

要旨

Support