超解像における画像評価の再考

要旨

近年の画像超解像（SR）技術は、その出力の知覚品質を継続的に向上させていますが、定量的評価ではしばしば失敗することがあります。この不一致により、既存の画像評価指標に対する不信感が高まっています。画像評価は指標と参照用のグラウンドトゥルース（GT）の両方に依存しますが、研究者は通常、GTの役割を検証しません。なぜなら、GTは一般的に「完璧な」参照として受け入れられているからです。しかし、データが初期の年に収集されたことや、他の種類の歪みを制御することを怠ったことにより、既存のSRデータセットのGTは比較的低品質である可能性があり、これが偏った評価を引き起こすことを指摘します。この観察に基づいて、本論文では以下の疑問に興味を持ちます：既存のSRデータセットのGT画像はモデル評価において100％信頼できるのか？GTの品質はこの評価にどのように影響するのか？そして、不完全なGTが存在する場合、どのように公平な評価を行うべきか？これらの疑問に答えるため、本論文では2つの主要な貢献を提示します。まず、3つの実世界のSRデータセットにわたる7つの最先端SRモデルを系統的に分析することにより、低品質のGTがモデル間で一貫してSR性能に影響を与えること、およびGT品質が制御された場合にモデルが大きく異なる性能を示すことを示します。次に、画像ペアの相対的な品質の不一致を測定する新しい知覚品質指標、Relative Quality Index（RQI）を提案し、信頼できないGTによる偏った評価を解消します。提案したモデルは、人間の意見との一貫性が大幅に向上しています。我々の研究が、将来のデータセット、モデル、および指標がどのように開発されるべきかについて、SRコミュニティに洞察を提供することを期待しています。

English

While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect' references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.

超解像における画像評価の再考

Rethinking Image Evaluation in Super-Resolution

要旨

Support