復興文化遺產:全面歷史文獻修復之創新途徑
Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration
July 7, 2025
作者: Yuyi Zhang, Peirong Zhang, Zhenhua Yang, Pengyu Yan, Yongxin Shi, Pengwei Liu, Fengjun Guo, Lianwen Jin
cs.AI
摘要
歷史文獻作為無價的文化遺產,歷經歲月侵蝕,因撕裂、水漬及氧化等現象遭受了顯著損壞。現有的歷史文獻修復(HDR)方法多聚焦於單一模態或小規模修復,難以滿足實際需求。為填補此空白,我們提出了一個全頁面HDR數據集(FPHDR)及一套創新的自動化HDR解決方案(AutoHDR)。具體而言,FPHDR包含1,633張真實與6,543張合成圖像,涵蓋字符級與行級定位,以及不同損壞等級下的字符標註。AutoHDR通過三階段流程模擬歷史學家的修復工作:OCR輔助的損壞定位、視覺-語境文本預測,以及基於補丁的自迴歸外觀修復。AutoHDR的模塊化架構實現了人機協作的無縫對接,允許在每一修復階段靈活介入與優化。實驗表明,AutoHDR在HDR任務中表現卓越。處理嚴重損壞文獻時,本方法將OCR準確率從46.83%提升至84.05%,通過人機協作更進一步提升至94.25%。我們相信,此項工作在自動化歷史文獻修復領域邁出了重要一步,對文化遺產保護貢獻顯著。模型與數據集已公開於https://github.com/SCUT-DLVCLab/AutoHDR。
English
Historical documents represent an invaluable cultural heritage, yet have
undergone significant degradation over time through tears, water erosion, and
oxidation. Existing Historical Document Restoration (HDR) methods primarily
focus on single modality or limited-size restoration, failing to meet practical
needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel
automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and
6,543 synthetic images with character-level and line-level locations, as well
as character annotations in different damage grades. AutoHDR mimics historians'
restoration workflows through a three-stage approach: OCR-assisted damage
localization, vision-language context text prediction, and patch autoregressive
appearance restoration. The modular architecture of AutoHDR enables seamless
human-machine collaboration, allowing for flexible intervention and
optimization at each restoration stage. Experiments demonstrate AutoHDR's
remarkable performance in HDR. When processing severely damaged documents, our
method improves OCR accuracy from 46.83\% to 84.05\%, with further enhancement
to 94.25\% through human-machine collaboration. We believe this work represents
a significant advancement in automated historical document restoration and
contributes substantially to cultural heritage preservation. The model and
dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.