ChatPaper.aiChatPaper

HyperVL:面向边缘设备的高效动态多模态大语言模型

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

December 16, 2025
作者: HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang
cs.AI

摘要

当前的多模态大语言模型虽具备强大的感知与推理能力,但其高昂的计算和内存需求导致难以直接部署于端侧环境。随着小参数模型逐渐被赋予强大的通用能力,标准视觉Transformer(ViT)编码器仍是关键瓶颈——在处理高分辨率输入时存在过高延迟和内存消耗。为应对这些挑战,我们推出了HyperVL:一款专为端侧推理设计的高效多模态大语言模型。HyperVL采用图像分块策略以限制峰值内存占用,并引入两项创新技术:(1)视觉分辨率压缩器(VRC),可自适应预测最优编码分辨率以消除冗余计算;(2)双一致性学习(DCL),通过统一框架对齐多尺度ViT编码器,实现在共享大语言模型下视觉分支的动态切换。大量实验表明,HyperVL在多个基准测试中均达到同规模模型的顶尖性能,并在真实移动设备上显著降低延迟与功耗,证明了其端侧多模态推理的实用性。
English
Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.
PDF332December 19, 2025