ペンギン-VL: LLMベースの視覚エンコーダによるVLMの効率限界の探求

要旨

Vision Language Model (VLM) の開発は、主にモデルサイズのスケーリングに依存してきたが、これはスマートフォンやロボットなどの計算リソースが限られたモバイル・エッジデバイスへの展開を妨げている。本研究では、コンパクトな（例: 2B、8Bパラメータ）VLMの性能限界を探求する。我々は、最先端のVLMが大規模な対照事前学習（例: CLIP/SigLIP）で初期化された視覚エンコーダに依存しなければならないという従来の慣行に異議を唱える。ここには目的の不一致がある：識別性のために最適化された対照学習は、粗いカテゴリレベルの不変性を強制し、密なキャプション生成や複雑なVLM推論に必要な細かな視覚的手がかりを抑制してしまうのである。この問題を解決するため、視覚エンコーダをテキストのみのLLMから初期化するPenguin-VLを提案する。実験により、Penguin-Encoderが従来の対照事前学習に代わる優れた選択肢であり、マルチモーダル理解における視覚的忠実度とデータ効率の更高を実現することが明らかになった。様々な画像・動画ベンチマークにおいて、Penguin-VLは数学的推論では主要なVLM（例: Qwen3-VL）に匹敵する性能を発揮し、文書理解、視覚的知識、多視点動画理解などのタスクではそれらを凌駕する。特筆すべきは、これらの性能向上が軽量なアーキテクチャで達成されていることであり、性能の主な駆動力はモデルのスケーリングではなく、改良された視覚表現であることを示している。アブレーションスタディでは、Penguin-Encoderが対照事前学習済みエンコーダを一貫して上回り、密な知覚と複雑な推論に不可欠な細かな空間的・時間的手がかりを保持することが確認された。これにより、計算効率の良いVLMにおける強力な代替コンポーネントとなり、リソース制約のある環境での高性能化を可能にする。コード: https://github.com/tencent-ailab/Penguin-VL

English

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL

ペンギン-VL: LLMベースの視覚エンコーダによるVLMの効率限界の探求

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

要旨

Support