无限OCR处理

摘要

近期，以DeepSeek OCR为代表的端到端OCR模型再次将OCR技术推至聚光灯下。一种普遍观点认为，采用大语言模型（LLM）作为解码器，能够使模型借助语言的先验分布，从而提升OCR性能。然而，其弊端同样显著：随着输出序列增长，累积的KV缓存导致内存消耗攀升，生成速度逐步放缓。这与人类在长篇幅抄写任务中效率不降的特性形成鲜明对比。本技术报告提出了Unlimited OCR模型，旨在模拟人类的解析工作记忆。以DeepSeek OCR为基线，我们将解码器中的所有注意力层替换为所提出的参考滑动窗口注意力机制（R-SWA），该机制在降低注意力计算成本的同时，使整个解码过程中的KV缓存保持恒定。通过结合DeepSeek OCR编码器的高压缩率与我们的恒定KV缓存设计，Unlimited OCR在标准最大长度32K内，单次前向传播即可转录数十页文档。更重要的是，R-SWA是一种通用的解析注意力机制——除OCR外，它同样适用于语音识别、翻译等任务。代码与模型权重已开源至http://github.com/baidu/Unlimited-OCR。

English

Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.