Penguin-VL:基於LLM視覺編碼器探索視覺語言模型的效能極限
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
March 6, 2026
作者: Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang
cs.AI
摘要
視覺語言模型(VLM)的發展長期依賴於擴大模型規模,這阻礙了其在計算資源受限的行動與邊緣設備(如智慧型手機和機器人)上的部署。本研究旨在探索緊湊型(如20億與80億參數)VLM的性能極限。我們挑戰當前主流觀點——即頂尖VLM必須依賴通過大規模對比式預訓練(如CLIP/SigLIP)初始化的視覺編碼器。我們發現存在目標錯配問題:專注於區分能力的對比學習會強制模型形成粗粒度、類別層級的不變性,從而壓抑了密集描述與複雜VLM推理所需的細粒度視覺線索。為解決此問題,我們提出Penguin-VL模型,其視覺編碼器直接從純文字大型語言模型初始化。實驗表明,Penguin編碼器可作為傳統對比式預訓練的優越替代方案,為多模態理解釋放更高程度的視覺保真度與數據效率。在多項圖像與影片基準測試中,Penguin-VL在數學推理任務上與主流VLM(如Qwen3-VL)表現相當,並在文件理解、視覺知識問答及多視角影片理解等任務中實現超越。值得注意的是,這些成果是通過輕量級架構實現的,證明視覺表徵能力的提升(而非模型擴容)才是性能突破的主要驅動力。消融實驗顯示,Penguin編碼器始終優於對比式預訓練編碼器,能保留對密集感知與複雜推理至關重要的細粒度時空線索,使其成為高效能計算VLM的即插即用替代方案,在資源受限環境中實現卓越性能。程式碼:https://github.com/tencent-ailab/Penguin-VL
English
Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL