PubMed-OCR:PMC开放获取OCR标注数据集
PubMed-OCR: PMC Open Access OCR Annotations
January 16, 2026
作者: Hunter Heidenreich, Yosheb Getachew, Olivia Dinica, Ben Elliott
cs.AI
摘要
PubMed-OCR是一个基于PubMed Central开放获取PDF文件构建的以光学字符识别为核心的科学文献语料库。每页图像均通过谷歌云视觉服务进行标注,并以紧凑的JSON格式发布,包含单词级、行级和段落级边界框标注。该语料库涵盖20.95万篇学术文章(150万页;约13亿词),支持布局感知建模、坐标定位问答以及OCR依赖流程的评估。我们分析了语料库特征(如期刊覆盖范围和检测到的版面特征),并讨论了其局限性,包括对单一OCR引擎的依赖和启发式行重建方法。我们公开数据和架构以促进下游研究,并欢迎扩展补充。
English
PubMed-OCR is an OCR-centric corpus of scientific articles derived from PubMed Central Open Access PDFs. Each page image is annotated with Google Cloud Vision and released in a compact JSON schema with word-, line-, and paragraph-level bounding boxes. The corpus spans 209.5K articles (1.5M pages; ~1.3B words) and supports layout-aware modeling, coordinate-grounded QA, and evaluation of OCR-dependent pipelines. We analyze corpus characteristics (e.g., journal coverage and detected layout features) and discuss limitations, including reliance on a single OCR engine and heuristic line reconstruction. We release the data and schema to facilitate downstream research and invite extensions.