오디오 플라밍고 3: 완전 개방형 대규모 오디오 언어 모델을 통한 오디오 인텔리전스의 진보

초록

우리는 음성, 소리, 음악에 걸친 추론과 이해를 발전시킨 최첨단 대규모 오디오-언어 모델인 Audio Flamingo 3(AF3)을 공개합니다. AF3은 다음과 같은 특징을 도입했습니다: (i) 음성, 소리, 음악이라는 세 가지 양식에 걸친 통합 표현 학습을 위한 새로운 전략으로 훈련된 통합 오디오 인코더인 AF-Whisper; (ii) 답변 전에 사고 연쇄(chain-of-thought) 방식의 추론을 수행할 수 있는 유연한 주문형 사고 기능; (iii) 다중 턴, 다중 오디오 채팅; (iv) 최대 10분까지의 긴 오디오(음성 포함) 이해 및 추론; 그리고 (v) 음성 대 음성 상호작용. 이러한 기능을 가능하게 하기 위해, 우리는 AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat 등 새로운 전략을 통해 큐레이션된 대규모 훈련 데이터셋을 제안하고, AF3을 새로운 5단계 커리큘럼 기반 훈련 전략으로 학습시켰습니다. 오픈소스 오디오 데이터만으로 훈련된 AF3은 20개 이상의 (긴) 오디오 이해 및 추론 벤치마크에서 새로운 최첨단 결과를 달성하며, 훨씬 더 큰 데이터셋으로 훈련된 오픈 가중치 및 클로즈드 소스 모델들을 능가했습니다.

English

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.