Lyra: オムニ認識のための効率的で音声中心のフレームワーク

要旨

マルチモーダル大規模言語モデル（MLLMs）が進化するにつれ、単一ドメインの能力を超えて拡張することは、より多目的で効率的なAIに対応するために不可欠です。ただし、以前のオムニモデルは、音声を不十分に探求し、その多様性と統合を無視してきました。本研究では、Lyraという効率的なMLLMを紹介し、高度な長い音声理解、音の理解、クロスモダリティの効率性、シームレスな音声インタラクションを含む多モーダル能力を向上させます。効率性と音声中心の能力を実現するために、Lyraは次の3つの戦略を採用しています：（1）既存のオープンソースの大規模モデルと提案されたマルチモダリティLoRAを活用して、トレーニングコストとデータ要件を削減します；（2）潜在的なマルチモダリティ正則化器とエクストラクタを使用して、音声と他のモダリティとの関係を強化し、モデルの性能を向上させます；（3）1.5Mのマルチモーダル（言語、ビジョン、音声）データサンプルと12Kの長い音声サンプルを含む高品質で広範なデータセットを構築し、複雑な長い音声入力を処理し、より堅牢なオムニ認知を実現します。他のオムニメソッドと比較して、Lyraは、さまざまなビジョン言語、ビジョン音声、音声言語のベンチマークで最先端のパフォーマンスを達成し、より少ない計算リソースとトレーニングデータを使用します。

English

As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multimodal abilities, including advanced long-speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data.

Lyra: オムニ認識のための効率的で音声中心のフレームワーク

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

要旨

Support