文化遗产复兴:历史文献全面修复的创新方法
Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration
July 7, 2025
作者: Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin
cs.AI
摘要
历史文献作为无价的文化遗产,历经岁月侵蚀,遭受了撕裂、水蚀和氧化等严重退化。现有的历史文献修复(HDR)方法多局限于单一模态或小规模修复,难以满足实际需求。为此,我们提出了一个全页HDR数据集(FPHDR)及一套创新的自动化HDR解决方案(AutoHDR)。具体而言,FPHDR包含1,633张真实图像与6,543张合成图像,均标注了字符级与行级位置信息,以及不同损坏程度下的字符注释。AutoHDR通过三阶段流程模拟历史学家的修复工作:OCR辅助的损伤定位、视觉-语言上下文文本预测,以及基于补丁的自回归外观修复。其模块化架构实现了人机协作的无缝对接,允许在每一修复阶段灵活介入与优化。实验验证了AutoHDR在HDR任务中的卓越表现,处理严重受损文献时,我们的方法将OCR准确率从46.83%提升至84.05%,人机协作后更进一步提升至94.25%。我们坚信,此项工作在自动化历史文献修复领域迈出了重要一步,为文化遗产保护做出了实质性贡献。模型与数据集已发布于https://github.com/SCUT-DLVCLab/AutoHDR。
English
Historical documents represent an invaluable cultural heritage, yet have
undergone significant degradation over time through tears, water erosion, and
oxidation. Existing Historical Document Restoration (HDR) methods primarily
focus on single modality or limited-size restoration, failing to meet practical
needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel
automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and
6,543 synthetic images with character-level and line-level locations, as well
as character annotations in different damage grades. AutoHDR mimics historians'
restoration workflows through a three-stage approach: OCR-assisted damage
localization, vision-language context text prediction, and patch autoregressive
appearance restoration. The modular architecture of AutoHDR enables seamless
human-machine collaboration, allowing for flexible intervention and
optimization at each restoration stage. Experiments demonstrate AutoHDR's
remarkable performance in HDR. When processing severely damaged documents, our
method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement
to 94.25\% through human-machine collaboration. We believe this work represents
a significant advancement in automated historical document restoration and
contributes substantially to cultural heritage preservation. The model and
dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.