DeepSeek-OCR 2:視覺因果流
DeepSeek-OCR 2: Visual Causal Flow
January 28, 2026
作者: Haoran Wei, Yaofeng Sun, Yukun Li
cs.AI
摘要
我們推出DeepSeek-OCR 2,旨在研究一種新型編碼器DeepEncoder V2的可行性——該架構能根據圖像語義動態重排視覺標記。傳統視覺語言模型(VLM)在將視覺標記輸入大型語言模型時,始終採用固定的光柵掃描順序(從左上到右下)與靜態位置編碼。然而這種方式與人類視覺感知相悖,後者會根據內在邏輯結構遵循靈活且語義連貫的掃描模式。尤其在處理複雜版式圖像時,人類視覺會展現因果驅動的序列化處理機制。受此認知機制啟發,DeepEncoder V2被設計為具備因果推理能力的編碼器,使其在基於LLM的內容解析前能智能重排視覺標記。本研究探索了一個新範式:是否可通過兩級串聯的一維因果推理結構有效實現二維圖像理解,從而為實現真正的二維推理提供新型架構思路。程式碼與模型權重已公開於http://github.com/deepseek-ai/DeepSeek-OCR-2。
English
We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.