VisIT-Bench：一个受现实世界使用启发的视觉-语言指导基准测试。

摘要

我们介绍了VisIT-Bench（Visual InsTruction Benchmark），这是一个用于评估适用于实际应用的指令跟随视觉-语言模型的基准。我们的出发点是策划了70个“指令系列”，我们设想指令调整的视觉-语言模型应该能够处理这些系列。任务不仅限于像VQAv2和COCO这样的评估，范围从基本识别到游戏玩法和创造性生成。在策划之后，我们的数据集包括592个测试查询，每个查询都附有一个由人类撰写的指令条件说明。这些描述展示了特定指令的因素，例如，对于一个询问店铺对轮椅用户的无障碍性的指令，指令条件说明描述了坡道/潜在障碍物。这些描述使得我们可以：1）为每个实例收集经人类验证的参考输出；2）使用仅文本的LLM自动评估候选多模态生成，与人类判断保持一致。我们通过人类和自动评估量化模型与参考之间的质量差距；例如，最优秀的指令跟随模型仅在27%的比较中击败了GPT-4参考。VisIT-Bench是一个动态的参与项目，从业者只需在项目网站上提交其模型的响应；数据、代码和排行榜可在visit-bench.github.io上找到。

English

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io.

VisIT-Bench：一个受现实世界使用启发的视觉-语言指导基准测试。

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

摘要

Support