DeepSeek-VL：邁向真實世界的視覺語言理解

摘要

我們介紹了 DeepSeek-VL，這是一個針對現實世界視覺和語言理解應用而設計的開源視覺-語言（VL）模型。我們的方法圍繞三個關鍵維度展開：我們努力確保我們的數據具有多樣性、可擴展性，並廣泛涵蓋包括網頁截圖、PDF、OCR、圖表和基於知識的內容在內的現實世界場景，旨在全面呈現實際情境。此外，我們從真實用戶場景中創建了一個用例分類法，並相應地構建了一個指令調整數據集。使用這個數據集進行微調顯著提升了模型在實際應用中的用戶體驗。考慮到效率和大多數現實世界場景的需求，DeepSeek-VL融合了一個混合視覺編碼器，能夠高效處理高分辨率圖像（1024 x 1024），同時保持相對較低的計算開銷。這種設計選擇確保了模型在各種視覺任務中捕捉關鍵語義和詳細信息的能力。我們認為，一個熟練的視覺-語言模型首要應該具備強大的語言能力。為確保在預訓練期間保留LLM能力，我們通過從一開始就整合LLM訓練，並仔細管理視覺和語言模態之間觀察到的競爭動態，研究了一種有效的VL預訓練策略。 DeepSeek-VL系列（包括1.3B和7B模型）在現實應用中作為一個視覺-語言聊天機器人展示出卓越的用戶體驗，在相同模型大小下實現了視覺-語言基準測試的最新技術或具有競爭力的表現，同時在以語言為中心的基準測試上保持了穩健的表現。我們已經使1.3B和7B模型公開可訪問，以促進基於這一基礎模型的創新。

English

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.

DeepSeek-VL：邁向真實世界的視覺語言理解

DeepSeek-VL: Towards Real-World Vision-Language Understanding

摘要

Support