AndesVL技術報告:一款高效的移動端多模態大型語言模型
AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model
October 13, 2025
作者: Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang Wang, Kai Tang, Zhiguang Zhu, Kai Tang, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie, Chen Chen, Haonan Lu
cs.AI
摘要
近年來,雖然基於雲端的多模態大語言模型(MLLMs)如QwenVL、InternVL、GPT-4o、Gemini和Claude Sonnet憑藉數千億參數的龐大模型規模展現了卓越的性能,但它們在記憶體、功耗和計算能力方面遠遠超出了手機等邊緣設備的限制。本文介紹了AndesVL,這是一套基於Qwen3大語言模型和多種視覺編碼器的移動端MLLMs,參數量從0.6B到4B不等。我們全面概述了AndesVL的模型架構、訓練流程和訓練數據,其在多個開源基準測試中取得了頂尖性能,涵蓋了文本豐富圖像理解、推理與數學、多圖像理解、通用視覺問答(VQA)、幻覺緩解、多語言理解以及與圖形用戶界面(GUI)相關的任務,與同規模的頂尖模型相比表現出色。此外,我們還引入了1+N LoRA(低秩適應)技術。
English
In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o,
Gemini, and Claude Sonnet have demonstrated outstanding performance with
enormous model sizes reaching hundreds of billions of parameters, they
significantly surpass the limitations in memory, power consumption, and
computing capacity of edge devices such as mobile phones. This paper introduces
AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on
Qwen3's LLM and various visual encoders. We comprehensively outline the model
architectures, training pipeline, and training data of AndesVL, which achieves
first-tier performance across a wide range of open-source benchmarks, including
fields such as text-rich image understanding, reasoning and math, multi-image
comprehension, general VQA, hallucination mitigation, multilingual
understanding, and GUI-related tasks when compared with state-of-the-art models
of a similar scale. Furthermore, we introduce a 1+N LoR