Aria:一个开放的多模态本地专家混合模型
Aria: An Open Multimodal Native Mixture-of-Experts Model
October 8, 2024
作者: Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, Junnan Li
cs.AI
摘要
信息以多种形式呈现。多模态本地人工智能模型对于整合现实世界信息并提供全面理解至关重要。虽然存在专有的多模态本地模型,但它们缺乏开放性,这给采用乃至适应带来了障碍。为填补这一空白,我们介绍了 Aria,一种开放的多模态本地模型,在各种多模态、语言和编码任务中表现出色。Aria 是一种专家混合模型,每个视觉令牌和文本令牌分别具有 39 亿和 35 亿激活参数。它的性能优于 Pixtral-12B 和 Llama3.2-11B,并在各种多模态任务上与最佳专有模型竞争。我们从头开始预训练 Aria,采用 4 阶段流水线,逐步赋予模型在语言理解、多模态理解、长上下文窗口和遵循指令等方面强大能力。我们开源模型权重,并提供一个代码库,便于在现实应用中轻松采用和适应 Aria。
English
Information comes in diverse modalities. Multimodal native AI models are
essential to integrate real-world information and deliver comprehensive
understanding. While proprietary multimodal native models exist, their lack of
openness imposes obstacles for adoptions, let alone adaptations. To fill this
gap, we introduce Aria, an open multimodal native model with best-in-class
performance across a wide range of multimodal, language, and coding tasks. Aria
is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual
token and text token, respectively. It outperforms Pixtral-12B and
Llama3.2-11B, and is competitive against the best proprietary models on various
multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline,
which progressively equips the model with strong capabilities in language
understanding, multimodal understanding, long context window, and instruction
following. We open-source the model weights along with a codebase that
facilitates easy adoptions and adaptations of Aria in real-world applications.Summary
AI-Generated Summary