Aria：一個開放的多模態本地專家混合模型

摘要

資訊以多樣的形式呈現。多模態本地人工智慧模型對於整合現實世界資訊並提供全面理解至關重要。儘管存在專有的多模態本地模型，但其缺乏開放性對於採用甚至適應都帶來障礙。為填補這一空白，我們介紹了 Aria，一個在各種多模態、語言和編碼任務中表現優異的開放式多模態本地模型。Aria 是一個專家混合模型，每個視覺標記和文本標記分別具有 39 億和 35 億個啟動參數。它優於 Pixtral-12B 和 Llama3.2-11B，並在各種多模態任務上與最佳專有模型競爭。我們從頭開始預訓練 Aria，採用 4 階段流程，逐步賦予模型在語言理解、多模態理解、長上下文窗口和指示遵循方面的強大能力。我們開源模型權重以及一個代碼庫，有助於在現實應用中輕鬆採用和適應 Aria。

English

Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.

Aria：一個開放的多模態本地專家混合模型

Aria: An Open Multimodal Native Mixture-of-Experts Model

摘要

Support