通用OCR理论:通过统一的端到端模型实现OCR-2.0
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
September 3, 2024
作者: Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang
cs.AI
摘要
传统的光学字符识别系统(OCR-1.0)由于对人造光学字符进行智能处理需求不断增长,已经越来越难以满足人们的使用需求。本文将所有人造光学信号(例如普通文本、数学/分子公式、表格、图表、乐谱,甚至几何形状)统称为“字符”,并提出了通用光学字符识别理论以及一个优秀的模型,即GOT,以推动OCR-2.0的到来。GOT模型具有58亿参数,是一个统一、优雅且端到端的模型,由高压缩编码器和长上下文解码器组成。作为OCR-2.0模型,GOT能够处理各种OCR任务下的所有上述“字符”。在输入端,该模型支持常用的场景和文档风格的图像,包括切片和整页样式。在输出端,GOT能够通过简单提示生成普通或格式化的结果(markdown/tikz/smiles/kern)。此外,该模型还具有交互式OCR功能,即通过坐标或颜色引导的区域级识别。此外,我们还将动态分辨率和多页OCR技术应用于GOT,以提高实用性。在实验中,我们提供充分的结果来证明我们模型的优越性。
English
Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's
usage due to the growing demand for intelligent processing of man-made optical
characters. In this paper, we collectively refer to all artificial optical
signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet
music, and even geometric shapes) as "characters" and propose the General OCR
Theory along with an excellent model, namely GOT, to promote the arrival of
OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end
model, consisting of a high-compression encoder and a long-contexts decoder. As
an OCR-2.0 model, GOT can handle all the above "characters" under various OCR
tasks. On the input side, the model supports commonly used scene- and
document-style images in slice and whole-page styles. On the output side, GOT
can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy
prompt. Besides, the model enjoys interactive OCR features, i.e., region-level
recognition guided by coordinates or colors. Furthermore, we also adapt dynamic
resolution and multi-page OCR technologies to GOT for better practicality. In
experiments, we provide sufficient results to prove the superiority of our
model.Summary
AI-Generated Summary