DeepSeek-VL：走向真实世界的视觉-语言理解

摘要

我们介绍了DeepSeek-VL，这是一个专为现实世界视觉与语言理解应用而设计的开源视觉-语言（VL）模型。我们的方法围绕三个关键维度展开：我们努力确保我们的数据具有多样性、可扩展性，并广泛涵盖包括网页截图、PDF、OCR、图表和基于知识的内容在内的现实场景，旨在全面表征实际背景。此外，我们从真实用户场景创建了用例分类法，并相应构建了一个指导微调数据集。使用这个数据集进行微调显著提高了模型在实际应用中的用户体验。考虑到效率和大多数现实场景的需求，DeepSeek-VL集成了一个混合视觉编码器，可以高效处理高分辨率图像（1024 x 1024），同时保持相对较低的计算开销。这种设计选择确保了模型能够在各种视觉任务中捕获关键语义和详细信息。我们认为，一个熟练的视觉-语言模型首先应具备强大的语言能力。为了确保在预训练期间保留LLM能力，我们研究了一种有效的VL预训练策略，通过从一开始就整合LLM训练，并仔细管理视觉和语言模态之间观察到的竞争动态。 DeepSeek-VL系列（包括1.3B和7B模型）在现实世界应用中作为视觉-语言聊天机器人展示出卓越的用户体验，在相同模型大小的情况下实现了一流或有竞争力的性能，同时在以语言为中心的基准测试中表现出稳健的性能。我们已经公开了1.3B和7B模型，以促进基于这一基础模型的创新。

English

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.

DeepSeek-VL：走向真实世界的视觉-语言理解

DeepSeek-VL: Towards Real-World Vision-Language Understanding

摘要

Support