企鹅-VL:探索基于LLM视觉编码器的视觉语言模型效率极限
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
March 6, 2026
作者: Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang
cs.AI
摘要
视觉语言模型(VLM)的发展长期依赖模型规模的扩大,这阻碍了其在计算资源受限的移动及边缘设备(如智能手机和机器人)上的部署。本研究探索了紧凑型(如20亿和80亿参数)VLM的性能极限,并对当前主流实践——即先进VLM必须依赖基于大规模对比预训练(如CLIP/SigLIP)初始化的视觉编码器——提出挑战。我们发现存在目标错位问题:以区分性为优化目标的对比学习会强制形成粗粒度的类别级不变性,从而抑制了密集描述和复杂VLM推理所需的细粒度视觉线索。针对该问题,我们提出Penguin-VL模型,其视觉编码器由纯文本大语言模型(LLM)初始化。实验表明,Penguin编码器可作为传统对比预训练的优越替代方案,为多模态理解解锁更高程度的视觉保真度和数据效率。在各类图像与视频基准测试中,Penguin-VL在数学推理任务上达到与主流VLM(如Qwen3-VL)相当的性能,并在文档理解、视觉知识问答及多视角视频理解等任务中实现超越。值得注意的是,这些成果通过轻量级架构实现,证明视觉表征的改进而非模型缩放才是性能提升的主要驱动力。消融实验显示,Penguin编码器始终优于对比预训练的编码器,能保留对密集感知和复杂推理至关重要的细粒度时空线索,使其成为计算高效型VLM的强力即插即用替代方案,助力资源受限场景下的高性能部署。代码地址:https://github.com/tencent-ailab/Penguin-VL
English
Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL