VisIT-Bench: 실제 세계 활용에서 영감을 받은 시각-언어 명령 수행 벤치마크

초록

우리는 실제 환경에서 사용하기 위한 지시-따르기 시각-언어 모델의 평가를 위한 벤치마크인 VisIT-Bench(Visual InsTruction Benchmark)를 소개합니다. 우리의 출발점은 지시 튜닝된 시각-언어 모델이 해결할 수 있어야 한다고 기대하는 70개의 '지시 패밀리'를 선별하는 것입니다. VQAv2 및 COCO와 같은 평가를 넘어, 작업 범위는 기본 인식부터 게임 플레이 및 창의적 생성까지 다양합니다. 선별 과정을 거쳐, 우리의 데이터셋은 각각 인간이 작성한 지시-조건부 캡션을 포함한 592개의 테스트 쿼리로 구성됩니다. 이러한 설명은 지시-특정 요소를 드러내는데, 예를 들어 휠체어 사용자를 위한 점포 접근성에 대해 묻는 지시의 경우, 지시-조건부 캡션은 경사로/잠재적 장애물을 설명합니다. 이러한 설명은 1) 각 인스턴스에 대해 인간이 검증한 참조 출력을 수집하고; 2) 텍스트 전용 LLM을 사용하여 후보 다중모달 생성물을 자동으로 평가하며, 이는 인간의 판단과 일치합니다. 우리는 인간 및 자동 평가를 모두 사용하여 모델과 참조 간의 품질 격차를 정량화합니다; 예를 들어, 최고 성능의 지시-따르기 모델은 GPT-4 참조와의 비교에서 단 27%의 경우에만 승리합니다. VisIT-Bench는 참여가 간단하며, 실무자는 프로젝트 웹사이트에 모델의 응답을 제출하기만 하면 됩니다; 데이터, 코드 및 리더보드는 visit-bench.github.io에서 확인할 수 있습니다.

English

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io.

VisIT-Bench: 실제 세계 활용에서 영감을 받은 시각-언어 명령 수행 벤치마크

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

초록

Support