Onbeperkte OCR-werking

Samenvatting

Recentelijk hebben end-to-end OCR-modellen, waarvan DeepSeek OCR een voorbeeld is, OCR opnieuw in de schijnwerpers gezet. Een algemeen heersende opvatting is dat het gebruik van een groot taalmodel (LLM) als decoder het model in staat stelt gebruik te maken van de a priori verdeling van taal, wat leidt tot betere OCR-prestaties. Het nadeel is echter even duidelijk: naarmate de uitvoerreeks langer wordt, zorgt de opgebouwde KV-cache voor een hoger geheugengebruik en vertraagt de generatie steeds verder. Dit staat in schril contrast met mensen, die bij het kopiëren over lange horizon geen dergelijke efficiëntieafname vertonen. In dit technische rapport stellen wij Unlimited OCR voor, een model dat is ontworpen om het menselijke verwerkingswerkgeheugen na te bootsen. Met DeepSeek OCR als uitgangspunt vervangen we alle aandachtslagen in de decoder door onze voorgestelde Reference Sliding Window Attention (R-SWA), die de rekenkosten voor aandacht vermindert terwijl de KV-cache gedurende het gehele decoderingsproces constant blijft. Door de hoge compressieverhouding van DeepSeek OCR's encoder te combineren met ons constante KV-cache-ontwerp, kan Unlimited OCR tientallen pagina's documenten in een enkele voorwaartse doorgang transcriberen onder een standaard maximale lengte van 32K. Belangrijker nog is dat R-SWA een algemeen toepasbaar verwerkingsaandachtsmechanisme is: naast OCR is het evenzeer toepasbaar op taken zoals ASR, vertaling, enz. Code en modelgewichten zijn openbaar beschikbaar op http://github.com/baidu/Unlimited-OCR.

English

Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.