OCR無制限

要旨

近年、DeepSeek OCRに代表されるエンドツーエンドのOCRモデルが、再びOCRを注目の的にしている。広く信じられている見解として、デコーダに大規模言語モデル（LLM）を採用することで、言語の事前分布を活用でき、OCR性能が向上するというものがある。しかし、その欠点も同様に明らかである。出力シーケンスが長くなるにつれて、蓄積されたKVキャッシュがメモリ消費を増加させ、生成速度を徐々に低下させる。これは、長期的なコピー作業において効率の低下を示さない人間とは対照的である。本テクニカルレポートでは、人間の解析作業記憶を模倣するように設計されたモデル、Unlimited OCRを提案する。DeepSeek OCRをベースラインとして、デコーダ内のすべてのアテンション層を、提案するReference Sliding Window Attention（R-SWA）に置き換える。これにより、アテンション計算コストを削減しつつ、復号プロセス全体を通じて一定のKVキャッシュを維持する。DeepSeek OCRのエンコーダの高い圧縮率と、当社の一定のKVキャッシュ設計を組み合わせることで、Unlimited OCRは標準の最大長32Kの下で、一度のフォワードパスで数十ページの文書を書き起こすことができる。さらに重要なことに、R-SWAは汎用的な解析アテンションメカニズムであり、OCR以外にもASRや翻訳などのタスクにも同様に適用できる。コードとモデルの重みは http://github.com/baidu/Unlimited-OCR で公開されている。

English

Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.