DeepSeek-OCR:上下文光學壓縮
DeepSeek-OCR: Contexts Optical Compression
October 21, 2025
作者: Haoran Wei, Yaofeng Sun, Yukun Li
cs.AI
摘要
我們提出DeepSeek-OCR作為一項初步研究,探討通過光學二維映射壓縮長文本的可行性。DeepSeek-OCR由兩個組件構成:DeepEncoder作為編碼器,DeepSeek3B-MoE-A570M作為解碼器。具體而言,DeepEncoder作為核心引擎,旨在高分辨率輸入下保持低激活度,同時實現高壓縮比,以確保視覺標記數量處於最佳且可管理的範圍。實驗表明,當文本標記數量在視覺標記數量的10倍以內(即壓縮比<10倍)時,模型能夠達到97%的解碼(OCR)精度。即使在20倍的壓縮比下,OCR準確率仍維持在約60%。這顯示出在歷史長文本壓縮及大語言模型(LLMs)記憶遺忘機制等研究領域具有相當大的潛力。此外,DeepSeek-OCR還展現出高度的實用價值。在OmniDocBench上,它僅使用100個視覺標記便超越了GOT-OCR2.0(每頁256個標記),並在平均每頁使用少於800個視覺標記的情況下,表現優於MinerU2.0(平均每頁6000+個標記)。在實際生產中,DeepSeek-OCR能夠以單張A100-40G顯卡每日生成超過20萬頁的訓練數據,用於大語言模型/視覺語言模型(LLMs/VLMs)。代碼及模型權重已公開於http://github.com/deepseek-ai/DeepSeek-OCR。
English
We present DeepSeek-OCR as an initial investigation into the feasibility of
compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two
components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically,
DeepEncoder serves as the core engine, designed to maintain low activations
under high-resolution input while achieving high compression ratios to ensure
an optimal and manageable number of vision tokens. Experiments show that when
the number of text tokens is within 10 times that of vision tokens (i.e., a
compression ratio < 10x), the model can achieve decoding (OCR) precision of
97%. Even at a compression ratio of 20x, the OCR accuracy still remains at
about 60%. This shows considerable promise for research areas such as
historical long-context compression and memory forgetting mechanisms in LLMs.
Beyond this, DeepSeek-OCR also demonstrates high practical value. On
OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision
tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while
utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can
generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a
single A100-40G). Codes and model weights are publicly accessible at
http://github.com/deepseek-ai/DeepSeek-OCR.