DeepSeek-OCR 2:视觉因果流
DeepSeek-OCR 2: Visual Causal Flow
January 28, 2026
作者: Haoran Wei, Yaofeng Sun, Yukun Li
cs.AI
摘要
我们推出DeepSeek-OCR 2,旨在研究一种新型编码器DeepEncoder V2的可行性——该编码器能够根据图像语义动态重排视觉标记。传统视觉语言模型在处理图像时,始终以固定的光栅扫描顺序(左上到右下)和静态位置编码将视觉标记输入大语言模型。然而这与人类视觉感知方式相悖,人类的视觉扫描会遵循由内在逻辑结构驱动的灵活且语义连贯的模式。尤其对于复杂版式图像,人类视觉会进行基于因果关系的序列化处理。受此认知机制启发,DeepEncoder V2被设计为具备因果推理能力的编码器,使其在基于LLM的内容解析前能智能重组视觉标记。本研究探索了一个新颖范式:是否可通过两级级联的一维因果推理结构有效实现二维图像理解,从而提供一种有望实现真正二维推理的全新架构方案。代码与模型权重已开源:http://github.com/deepseek-ai/DeepSeek-OCR-2。
English
We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.