TP-Eval：通过定制提示来发挥多模态LLM的评估潜力

摘要

最近，多模态大型语言模型（MLLMs）因其令人印象深刻的能力而受到广泛关注。评估MLLMs对于分析MLLMs的属性并提供有价值的见解变得至关重要。然而，当前的基准测试忽视了提示敏感性这一问题 - 微小的提示变化可能导致性能波动显著。因此，不恰当的提示可能会掩盖模型的能力，低估模型的性能。此外，不同模型对不同提示有不同偏好，因此，对所有模型使用相同的提示将导致评估偏差。本文分析了现有基准测试中存在的这一缺陷，并进一步引入了一个名为TP-Eval的新评估框架，该框架引入了提示定制方法以减少评估偏差并挖掘模型的潜力。TP-Eval将为不同模型重写原始提示为不同的定制提示。特别地，我们提出了一些针对MLLM评估场景量身定制的提示定制模块。大量实验表明了我们的方法揭示模型能力的有效性，TP-Eval应有助于社区开发更全面和有说服力的MLLM评估基准。

English

Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models' capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.

TP-Eval：通过定制提示来发挥多模态LLM的评估潜力

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

摘要

Support