AndesVL技术报告:一款高效的移动端多模态大语言模型
AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model
October 13, 2025
作者: Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang Wang, Kai Tang, Zhiguang Zhu, Kai Tang, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie, Chen Chen, Haonan Lu
cs.AI
摘要
近年来,尽管诸如QwenVL、InternVL、GPT-4o、Gemini和Claude Sonnet等基于云的多模态大语言模型(MLLMs)凭借高达数千亿参数的庞大规模展现了卓越性能,但它们远远超出了手机等边缘设备在内存、功耗及计算能力上的限制。本文介绍了AndesVL,一套基于Qwen3大语言模型及多种视觉编码器、参数规模从0.6B到4B不等的移动端MLLM系列。我们全面概述了AndesVL的模型架构、训练流程及训练数据,该系列在广泛的开放基准测试中,包括富文本图像理解、推理与数学、多图像理解、通用视觉问答(VQA)、幻觉缓解、多语言理解以及图形用户界面(GUI)相关任务等领域,与同规模的最先进模型相比,均达到了顶尖水平。此外,我们还引入了一种1+N低秩适配(LoRA)策略。
English
In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o,
Gemini, and Claude Sonnet have demonstrated outstanding performance with
enormous model sizes reaching hundreds of billions of parameters, they
significantly surpass the limitations in memory, power consumption, and
computing capacity of edge devices such as mobile phones. This paper introduces
AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on
Qwen3's LLM and various visual encoders. We comprehensively outline the model
architectures, training pipeline, and training data of AndesVL, which achieves
first-tier performance across a wide range of open-source benchmarks, including
fields such as text-rich image understanding, reasoning and math, multi-image
comprehension, general VQA, hallucination mitigation, multilingual
understanding, and GUI-related tasks when compared with state-of-the-art models
of a similar scale. Furthermore, we introduce a 1+N LoR