HyperVL:面向邊緣設備的高效動態多模態大型語言模型
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
December 16, 2025
作者: HyperAI Team, Yuchen Liu, Kaiyang Han, Zhiqiang Xia, Yuhang Dong, Chen Song, Kangyu Tang, Jiaming Xu, Xiushi Feng, WenXuan Yu, Li Peng, Mingyang Wang, Kai Wang, Changpeng Yang, Yang Li, Haoyu Lu, Hao Wang, Bingna Xu, Guangyao Liu, Long Huang, Kaibin Guo, Jinyang Wu, Dan Wu, Hongzhen Wang, Peng Zhou, Shuai Nie, Shande Wang, Runyu Shi, Ying Huang
cs.AI
摘要
當前多模態大語言模型雖具備強大的感知與推理能力,但高昂的計算與記憶體需求使其難以直接部署於端側設備環境。儘管小參數模型正逐步被賦予強大的通用能力,標準視覺Transformer(ViT)編碼器仍是關鍵瓶頸,在處理高解析度輸入時面臨過高的延遲與記憶體消耗。為應對這些挑戰,我們提出HyperVL——一款專為端側推理設計的高效多模態大語言模型。HyperVL採用圖像分塊策略以限制峰值記憶體使用,並引入兩項創新技術:(1)視覺解析度壓縮器(VRC),能自適應預測最佳編碼解析度以消除冗餘計算;(2)雙重一致性學習(DCL),通過統一框架對齊多尺度ViT編碼器,實現共享大語言模型下視覺分支的動態切換。大量實驗表明,HyperVL在多個基準測試中均達到同規模模型的頂尖性能,並在真實行動裝置上顯著降低延遲與功耗,證實其對端側多模態推理的實用性。
English
Current multimodal large lanauge models possess strong perceptual and reasoning capabilities, however high computational and memory requirements make them difficult to deploy directly on on-device environments. While small-parameter models are progressively endowed with strong general capabilities, standard Vision Transformer (ViT) encoders remain a critical bottleneck, suffering from excessive latency and memory consumption when processing high-resolution inputs.To address these challenges, we introduce HyperVL, an efficient multimodal large language model tailored for on-device inference. HyperVL adopts an image-tiling strategy to cap peak memory usage and incorporates two novel techniques: (1) a Visual Resolution Compressor (VRC) that adaptively predicts optimal encoding resolutions to eliminate redundant computation, and (2) Dual Consistency Learning (DCL), which aligns multi-scale ViT encoders within a unified framework, enabling dynamic switching between visual branches under a shared LLM. Extensive experiments demonstrate that HyperVL achieves state-of-the-art performance among models of comparable size across multiple benchmarks. Furthermore, it significantly significantly reduces latency and power consumption on real mobile devices, demonstrating its practicality for on-device multimodal inference.