Awaker2.5-VL：パラメータ効率の良いエキスパートの混合によるMLLMの安定したスケーリング

要旨

マルチモーダル大規模言語モデル（MLLM）の研究が一般的になるにつれて、進化するMLLMモデルは通常、実世界のアプリケーションにおいてさまざまなテキストおよびビジュアルタスク（例：VQA、Detection、OCR、ChartQA）を同時に処理する必要があります。ただし、さまざまなタスクからのデータの表現と分布には大きな違いがあるため、単にすべてのタスクのデータを混合すると、よく知られた「マルチタスクの衝突」問題が発生し、さまざまなタスクで性能が低下します。この問題に対処するために、私たちはAwaker2.5-VLを提案します。これはMLLMに適した専門家の混合（MoE）アーキテクチャであり、複数の疎に活性化された専門家を介してマルチタスクの機能を取得します。Awaker2.5-VLのトレーニングと推論を高速化するために、当モデル内の各専門家は低ランク適応（LoRA）構造として設計されています。複数の最新ベンチマークでの包括的な実験により、Awaker2.5-VLの効果が示されています。コードとモデルの重みは、弊社のプロジェクトページで公開されています：https://github.com/MetabrainAGI/Awaker.

English

As the research of Multimodal Large Language Models (MLLMs) becomes popular, an advancing MLLM model is typically required to handle various textual and visual tasks (e.g., VQA, Detection, OCR, and ChartQA) simultaneously for real-world applications. However, due to the significant differences in representation and distribution among data from various tasks, simply mixing data of all tasks together leads to the well-known``multi-task conflict" issue, resulting in performance degradation across various tasks. To address this issue, we propose Awaker2.5-VL, a Mixture of Experts~(MoE) architecture suitable for MLLM, which acquires the multi-task capabilities through multiple sparsely activated experts. To speed up the training and inference of Awaker2.5-VL, each expert in our model is devised as a low-rank adaptation (LoRA) structure. Extensive experiments on multiple latest benchmarks demonstrate the effectiveness of Awaker2.5-VL. The code and model weight are released in our Project Page: https://github.com/MetabrainAGI/Awaker.

Awaker2.5-VL：パラメータ効率の良いエキスパートの混合によるMLLMの安定したスケーリング

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

要旨

Support