基於擴散模型的文本感知圖像修復

摘要

圖像復原旨在恢復受損圖像。然而，現有的基於擴散模型的復原方法，儘管在自然圖像復原方面取得了巨大成功，卻往往難以忠實重建受損圖像中的文字區域。這些方法經常生成看似合理但實際錯誤的類文字圖案，這一現象我們稱之為文字圖像幻覺。本文中，我們引入了文字感知圖像復原（TAIR），這是一種新穎的復原任務，要求同時恢復視覺內容與文字保真度。為應對這一任務，我們提出了SA-Text，這是一個包含10萬張高質量場景圖像的大規模基準數據集，這些圖像密集標註了多樣且複雜的文字實例。此外，我們提出了一種多任務擴散框架，名為TeReDiff，該框架將擴散模型的內部特徵整合到文字檢測模塊中，使兩者能夠從聯合訓練中受益。這使得能夠提取豐富的文字表示，這些表示在後續的去噪步驟中被用作提示。大量實驗表明，我們的方法在文字識別準確率上持續超越現有最先進的復原方法，取得了顯著的提升。詳見我們的項目頁面：https://cvlab-kaist.github.io/TAIR/。

English

Image restoration aims to recover degraded images. However, existing diffusion-based restoration methods, despite great success in natural image restoration, often struggle to faithfully reconstruct textual regions in degraded images. Those methods frequently generate plausible but incorrect text-like patterns, a phenomenon we refer to as text-image hallucination. In this paper, we introduce Text-Aware Image Restoration (TAIR), a novel restoration task that requires the simultaneous recovery of visual contents and textual fidelity. To tackle this task, we present SA-Text, a large-scale benchmark of 100K high-quality scene images densely annotated with diverse and complex text instances. Furthermore, we propose a multi-task diffusion framework, called TeReDiff, that integrates internal features from diffusion models into a text-spotting module, enabling both components to benefit from joint training. This allows for the extraction of rich text representations, which are utilized as prompts in subsequent denoising steps. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art restoration methods, achieving significant gains in text recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/

基於擴散模型的文本感知圖像修復

Text-Aware Image Restoration with Diffusion Models

摘要

Support