基于扩散模型的文本感知图像修复
Text-Aware Image Restoration with Diffusion Models
June 11, 2025
作者: Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, Seungryong Kim
cs.AI
摘要
图像复原旨在修复退化的图像。然而,现有的基于扩散模型的复原方法,尽管在自然图像复原方面取得了巨大成功,却往往难以忠实地重建退化图像中的文本区域。这些方法经常生成看似合理但实际错误的类文本模式,这一现象我们称之为文本图像幻觉。本文中,我们提出了文本感知图像复原(TAIR),这是一项新颖的复原任务,要求同时恢复视觉内容和文本保真度。为应对这一任务,我们推出了SA-Text,一个包含10万张高质量场景图像的大规模基准数据集,这些图像密集标注了多样且复杂的文本实例。此外,我们提出了一种名为TeReDiff的多任务扩散框架,该框架将扩散模型的内部特征整合到文本检测模块中,使两者都能从联合训练中获益。这使得能够提取丰富的文本表示,这些表示在后续去噪步骤中被用作提示。大量实验证明,我们的方法在文本识别准确率上持续超越现有最先进的复原方法,取得了显著提升。访问我们的项目页面:https://cvlab-kaist.github.io/TAIR/
English
Image restoration aims to recover degraded images. However, existing
diffusion-based restoration methods, despite great success in natural image
restoration, often struggle to faithfully reconstruct textual regions in
degraded images. Those methods frequently generate plausible but incorrect
text-like patterns, a phenomenon we refer to as text-image hallucination. In
this paper, we introduce Text-Aware Image Restoration (TAIR), a novel
restoration task that requires the simultaneous recovery of visual contents and
textual fidelity. To tackle this task, we present SA-Text, a large-scale
benchmark of 100K high-quality scene images densely annotated with diverse and
complex text instances. Furthermore, we propose a multi-task diffusion
framework, called TeReDiff, that integrates internal features from diffusion
models into a text-spotting module, enabling both components to benefit from
joint training. This allows for the extraction of rich text representations,
which are utilized as prompts in subsequent denoising steps. Extensive
experiments demonstrate that our approach consistently outperforms
state-of-the-art restoration methods, achieving significant gains in text
recognition accuracy. See our project page: https://cvlab-kaist.github.io/TAIR/