Aria: オープンなマルチモーダルなネイティブ専門家モデル

要旨

情報はさまざまな形態で提供されます。マルチモーダルなネイティブAIモデルは、現実世界の情報を統合し包括的な理解を提供するために不可欠です。専用のマルチモーダルなネイティブモデルが存在するものの、その非公開性は採用、さらには適応を妨げる障壁となっています。このギャップを埋めるために、私たちはAriaを紹介します。Ariaは、幅広いマルチモーダル、言語、およびコーディングタスクで最高クラスのパフォーマンスを発揮するオープンなマルチモーダルなネイティブモデルです。Ariaは、ビジュアルトークンごとに3.9B、テキストトークンごとに3.5Bのアクティブ化されたパラメータを持つエキスパートモデルの混合物です。AriaはPixtral-12BやLlama3.2-11Bを上回り、さまざまなマルチモーダルタスクで最高の専用モデルに対抗する性能を発揮します。私たちは、Ariaを4段階のパイプラインに従ってゼロから事前トレーニングし、言語理解、マルチモーダル理解、長いコンテキストウィンドウ、および命令の遵守といった強力な機能をモデルに徐々に装備しています。私たちは、モデルの重みをオープンソース化し、Ariaの実世界のアプリケーションでの簡単な採用と適応を可能にするコードベースも公開しています。

English

Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.

Aria: オープンなマルチモーダルなネイティブ専門家モデル

Aria: An Open Multimodal Native Mixture-of-Experts Model

要旨

Support