重新思考超分辨率中的图像评估

摘要

尽管近年来图像超分辨率（SR）技术在不断提升其输出结果的感知质量，但这些技术通常在定量评估中表现不佳。这种不一致性导致了对现有SR评估图像指标的日益不信任。虽然图像评估依赖于指标和参考真实值（GT），但研究人员通常不会检查GT的作用，因为它们普遍被视为“完美”的参考。然而，由于数据收集于早期且忽视了控制其他类型的失真，我们指出现有SR数据集中的GT可能表现出相对较差的质量，从而导致评估偏差。基于这一观察，本文关注以下问题：现有SR数据集中的GT图像是否100%可信用于模型评估？GT质量如何影响这一评估？如果存在不完美的GT，如何进行公平评估？为回答这些问题，本文提出了两项主要贡献。首先，通过系统分析三个真实世界SR数据集上的七种最先进的SR模型，我们展示了低质量GT可以一致地影响SR模型的性能，且当控制GT质量时，模型表现会有显著差异。其次，我们提出了一种新颖的感知质量指标——相对质量指数（RQI），它衡量图像对之间的相对质量差异，从而解决了由不可靠GT引起的评估偏差问题。我们提出的模型在与人主观评价的一致性上取得了显著提升。我们期望这项工作能为SR社区提供关于未来数据集、模型和指标应如何发展的洞见。

English

While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect' references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.

重新思考超分辨率中的图像评估

Rethinking Image Evaluation in Super-Resolution

摘要

Support