重新思考超分辨率中的图像评估

摘要

儘管近年來圖像超分辨率（SR）技術不斷提升其輸出結果的感知質量，但在定量評估中往往表現不佳。這種不一致性導致了對現有SR評估圖像指標的日益不信任。雖然圖像評估依賴於指標和參考真實值（GT），但研究人員通常不會審視GT的作用，因為它們普遍被視為「完美」的參考。然而，由於數據收集於早期，且忽視了對其他類型失真的控制，我們指出現有SR數據集中的GT可能質量較差，從而導致評估偏差。基於這一觀察，本文探討以下問題：現有SR數據集中的GT圖像是否百分之百可信用於模型評估？GT質量如何影響這一評估？如果存在不完美的GT，如何進行公平評估？為回答這些問題，本文提出了兩項主要貢獻。首先，通過系統分析七種最先進的SR模型在三個真實世界SR數據集上的表現，我們展示了低質量GT能一致性地影響SR模型的性能，且當控制GT質量時，模型表現會有顯著差異。其次，我們提出了一種新的感知質量指標——相對質量指數（RQI），它衡量圖像對之間的相對質量差異，從而解決了由不可靠GT引起的評估偏差問題。我們提出的模型在與人類意見的一致性上取得了顯著提升。我們期望這項工作能為SR社區提供洞見，指導未來數據集、模型和指標的開發方向。

English

While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect' references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.

重新思考超分辨率中的图像评估

Rethinking Image Evaluation in Super-Resolution

摘要

Support