オープンエンドな視覚的品質比較に向けて

要旨

比較設定（例：ペアワイズ選択、リストワイズランキング）は、画像品質評価（IQA）における主観的研究において広く採用されてきました。これは、異なる観察者間で評価基準を標準化し、より明確な回答を提供するためです。本研究では、新興の大規模マルチモーダルモデル（LMMs）の可能性を拡張し、視覚的品質比較をオープンエンド設定にさらに進化させます。具体的には、1) 品質比較に関する広範な質問に応答できること、2) 直接的な回答を超えた詳細な理由を提供できることです。この目的のために、Co-Instructを提案します。この初のオープンソースのオープンエンド視覚品質比較器を訓練するために、Co-Instruct-562Kデータセットを収集しました。このデータセットは、2つのソースから構成されています：(a) LMMを統合した単一画像品質記述、(b) 未ラベルデータに対するGPT-4V「教師」の回答。さらに、この設定をより適切に評価するために、LMMsのための初のマルチ画像比較ベンチマークであるMICBenchを提案します。Co-Instructは、既存の関連ベンチマークと提案されたMICBenchの両方において、最先端のオープンソースLMMsよりも30%高い優位精度を達成し、GPT-4V（その教師）をも凌駕することを実証します。私たちのモデルはhttps://huggingface.co/q-future/co-instructで公開されています。

English

Comparative settings (e.g. pairwise choice, listwise ranking) have been adopted by a wide range of subjective studies for image quality assessment (IQA), as it inherently standardizes the evaluation criteria across different observers and offer more clear-cut responses. In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into open-ended settings, that 1) can respond to open-range questions on quality comparison; 2) can provide detailed reasonings beyond direct answers. To this end, we propose the Co-Instruct. To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: (a) LMM-merged single image quality description, (b) GPT-4V "teacher" responses on unlabeled data. Furthermore, to better evaluate this setting, we propose the MICBench, the first benchmark on multi-image comparison for LMMs. We demonstrate that Co-Instruct not only achieves 30% higher superior accuracy than state-of-the-art open-source LMMs, but also outperforms GPT-4V (its teacher), on both existing related benchmarks and the proposed MICBench. Our model is published at https://huggingface.co/q-future/co-instruct.

オープンエンドな視覚的品質比較に向けて

Towards Open-ended Visual Quality Comparison

要旨

Support