任务问我任何事
Task Me Anything
June 17, 2024
作者: Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna
cs.AI
摘要
大型多模态语言模型(MLM)的基准现在用于同时评估模型的整体能力,而不是评估特定能力。因此,当开发人员想要确定哪些模型适用于其应用时,他们会被大量基准所淹没,并且对哪个基准结果最能反映其特定用例感到不确定。本文介绍了Task-Me-Anything,这是一个生成定制基准的引擎,以满足用户需求。Task-Me-Anything保持了一个可扩展的视觉资产分类法,并可以以程序化方式生成大量任务实例。此外,它通过算法有效地回答用户关于MLM性能的查询,而且在计算预算内。它包含113K张图片,10K个视频,2K个3D物体资产,超过365个物体类别,655个属性和335个关系。它可以生成750M个图像/视频问答对,重点评估MLM的感知能力。Task-Me-Anything揭示了一些关键见解:开源MLM在物体和属性识别方面表现出色,但缺乏空间和时间理解;每个模型都有独特的优势和劣势;通常较大的模型表现更好,尽管也存在例外情况;而GPT4o在识别旋转/移动物体和区分颜色方面存在挑战。
English
Benchmarks for large multimodal language models (MLMs) now serve to
simultaneously assess the general capabilities of models instead of evaluating
for a specific capability. As a result, when a developer wants to identify
which models to use for their application, they are overwhelmed by the number
of benchmarks and remain uncertain about which benchmark's results are most
reflective of their specific use case. This paper introduces Task-Me-Anything,
a benchmark generation engine which produces a benchmark tailored to a user's
needs. Task-Me-Anything maintains an extendable taxonomy of visual assets and
can programmatically generate a vast number of task instances. Additionally, it
algorithmically addresses user queries regarding MLM performance efficiently
within a computational budget. It contains 113K images, 10K videos, 2K 3D
object assets, over 365 object categories, 655 attributes, and 335
relationships. It can generate 750M image/video question-answering pairs, which
focus on evaluating MLM perceptual capabilities. Task-Me-Anything reveals
critical insights: open-source MLMs excel in object and attribute recognition
but lack spatial and temporal understanding; each model exhibits unique
strengths and weaknesses; larger models generally perform better, though
exceptions exist; and GPT4o demonstrates challenges in recognizing
rotating/moving objects and distinguishing colors.Summary
AI-Generated Summary