VisIT-Bench：実世界の使用に着想を得た視覚-言語指示追従のベンチマーク

要旨

私たちは、実世界での使用を想定した指示追従型視覚言語モデルの評価のためのベンチマーク「VisIT-Bench（Visual InsTruction Benchmark）」を紹介します。出発点として、指示チューニングされた視覚言語モデルが対応すべき70の「指示ファミリー」を精選しました。VQAv2やCOCOなどの評価を超えて、基本的な認識からゲームプレイ、創造的な生成まで幅広いタスクを網羅しています。精選後、私たちのデータセットは592のテストクエリで構成され、それぞれに人間が作成した指示条件付きキャプションが付属しています。これらの説明は指示固有の要素を浮き彫りにします。例えば、車椅子ユーザーにとっての店舗のアクセシビリティについて尋ねる指示に対して、指示条件付きキャプションはスロープや潜在的な障害物を説明します。これらの説明により、1) 各インスタンスに対する人間による検証済みの参照出力を収集し、2) テキストのみのLLMを使用して候補となるマルチモーダル生成を自動評価し、人間の判断と整合させることが可能になります。私たちは、人間による評価と自動評価の両方を使用して、モデルと参照出力の間の品質ギャップを定量化します。例えば、最高の指示追従モデルでさえ、GPT-4の参照出力に対してわずか27%の比較で勝利しています。VisIT-Benchは動的に参加可能で、実践者はプロジェクトのウェブサイトにモデルの応答を提出するだけで参加できます。データ、コード、リーダーボードはvisit-bench.github.ioで利用可能です。

English

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io.

VisIT-Bench：実世界の使用に着想を得た視覚-言語指示追従のベンチマーク

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

要旨

Support