Penguin-VL: LLM 기반 비전 인코더를 활용한 VLM 효율성 한계 탐구

초록

비전 언어 모델(VLM) 개발은 주로 모델 규모 확장에 의존해 왔으며, 이는 스마트폰 및 로봇과 같은 컴퓨팅 자원이 제한된 모바일 및 엣지 기기에서의 배포를 어렵게 합니다. 본 연구에서는 컴팩트(예: 2B, 8B) VLM의 성능 한계를 탐구합니다. 우리는 최첨단 VLM이 대규모 대조 학습 기반 사전 훈련(예: CLIP/SigLIP)으로 초기화된 비전 인코더에 의존해야 한다는 기존 관행에 의문을 제기합니다. 우리는 목표 불일치 문제를 확인했습니다. 즉, 판별 능력 최적화를 위한 대조 학습은 조밀한 캡션 생성 및 복잡한 VLM 추론에 필요한 세밀한 시각적 단서를 억제하는 거친 범주 수준의 불변성을 강요합니다. 이 문제를 해결하기 위해 비전 인코더가 텍스트 전용 LLM으로부터 초기화된 Penguin-VL을 제안합니다. 우리의 실험 결과, Penguin 인코더는 기존의 대조 학습 기반 사전 훈련보다 우수한 대안으로서, 다중 모달 이해를 위한 더 높은 수준의 시각적 정확도와 데이터 효율성을 제공함을 보여줍니다. 다양한 이미지 및 비디오 벤치마크에서 Penguin-VL은 수학적 추론 영역에서는 선두 VLM(예: Qwen3-VL)에 버금가는 성능을 달성했으며, 문서 이해, 시각적 지식, 다중 시점 비디오 이해와 같은 과제에서는 이를 능가했습니다. 특히 이러한 성과는 경량 아키텍처로 달성되어, 모델 규모 확장보다 개선된 시각적 표현이 성능의 주요 동력임을 입증합니다. 우리의 애블레이션 연구는 Penguin 인코더가 조밀한 인식 및 복잡한 추론에至关重要的한 세밀한 공간적 및 시간적 단서를 보존하면서 대조 학습 기반 인코더를 지속적으로 능가함을 보여줍니다. 이는 컴퓨팅 효율적인 VLM을 위한 강력한 즉시 대체 옵션이며, 자원이 제한된 환경에서도 높은 성능을 가능하게 합니다. 코드: https://github.com/tencent-ailab/Penguin-VL

English

Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL

Penguin-VL: LLM 기반 비전 인코더를 활용한 VLM 효율성 한계 탐구

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

초록

Support