ChatPaper.aiChatPaper

AutoNeural:面向NPU推理的视觉语言模型协同设计 (注:NPU指神经网络处理器,是专为神经网络计算设计的硬件加速器。标题简洁地传达了论文核心——通过软硬件协同设计优化视觉语言模型在NPU上的推理性能。)

AutoNeural: Co-Designing Vision-Language Models for NPU Inference

December 2, 2025
作者: Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, Han Yang
cs.AI

摘要

尽管神经处理单元(NPU)在边缘AI领域具有较高的理论效率,但专为GPU优化的先进视觉语言模型(VLM)在此类硬件上往往表现不佳。我们将这种硬件与模型的不匹配归因于两大核心因素:视觉变换器(ViT)的量化脆弱性,以及自回归注意力机制受I/O限制的特性——这些特性使其无法充分利用NPU的高算术吞吐量。为弥补这一差距,我们提出AutoNeural:一种与NPU协同设计、专为纯整数推理而生的原生VLM架构。我们采用基于深度可分离卷积的MobileNetV5风格主干网络替代标准ViT编码器,确保激活值分布有界以实现稳定的INT4/8/16量化。与之互补的是,我们的语言主干网络将状态空间模型(SSM)原理与变换器层相结合,通过高效门控卷积实现线性时间复杂度。这种混合设计消除了生成过程中键值缓存带来的沉重内存I/O开销。实验表明,该方法显著提升效率:视觉编码器的量化误差较传统基线降低达7倍,端到端延迟减少14倍。AutoNeural还实现了3倍的解码速度提升和4倍的上下文窗口扩展。我们通过在高通SA8295P系统级芯片上的真实汽车案例研究验证这些改进,证明了其在座舱应用中可实现实时性能。研究结果凸显了针对NPU约束重新设计模型拓扑结构是实现稳健多模态边缘智能的先决条件。
English
While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.
PDF41December 5, 2025