무제한 OCR 작동

초록

최근 DeepSeek OCR로 대표되는 종단간(end-to-end) OCR 모델들이 OCR을 다시 한번 주목받게 하였다. 널리 퍼진 견해는 디코더(decoder)로 대규모 언어 모델(LLM)을 사용하면 언어의 사전 분포를 활용하여 OCR 성능이 향상된다는 것이다. 그러나 단점도 명확하다. 출력 시퀀스가 길어짐에 따라 누적된 KV 캐시로 인해 메모리 소비가 증가하고 생성 속도가 점차 느려진다. 이는 장기 복사 작업에서 효율 저하를 보이지 않는 인간과 대조적이다. 본 기술 보고서에서 우리는 인간의 구문 분석 작업 기억(parsing working memory)을 모방하도록 설계된 Unlimited OCR 모델을 제안한다. DeepSeek OCR을 기준 모델로 삼아, 디코더의 모든 어텐션 레이어를 우리가 제안하는 참조 슬라이딩 윈도 어텐션(Reference Sliding Window Attention, R-SWA)으로 대체함으로써 어텐션 계산 비용을 줄이고 전체 디코딩 과정에서 KV 캐시를 일정하게 유지한다. DeepSeek OCR 인코더의 높은 압축률과 일정한 KV 캐시 설계를 결합하여, Unlimited OCR은 표준 최대 길이 32K에서 단일 순방향 패스(forward pass)로 수십 페이지 분량의 문서를 전사(transcribe)할 수 있다. 더 중요한 점은, R-SWA는 범용 구문 분석 어텐션 메커니즘으로, OCR 외에도 ASR, 번역 등 다양한 작업에 동일하게 적용 가능하다는 것이다. 코드와 모델 가중치는 http://github.com/baidu/Unlimited-OCR에서 공개적으로 이용 가능하다.

English

Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.