TP-Eval：カスタマイズされたプロンプトによる評価で、タップマルチモーダルLLMの潜在能力を引き出す

要旨

最近、多様なモダリティを持つ大規模言語モデル（MLLMs）はその印象的な能力により注目を集めています。MLLMsの評価は、MLLMsの属性を分析し有益な洞察を提供するために重要性を増しています。しかしながら、現行のベンチマークはプロンプトの感度の問題を見落としており、わずかなプロンプトの変化が性能の大幅な変動につながる可能性があります。したがって、不適切なプロンプトはモデルの能力を曇らせ、モデルの性能を過小評価する可能性があります。さらに、異なるモデルは異なるプロンプトに対する異なる傾向を持っており、すべてのモデルに同じプロンプトを使用することは評価の偏りを引き起こします。本論文では、既存のベンチマークのこの欠陥を分析し、評価バイアスを軽減しモデルの潜在能力を引き出すためのプロンプトのカスタマイズ方法を導入した新しい評価フレームワークであるTP-Evalを紹介します。TP-Evalは、異なるモデルに対して異なるカスタマイズされたプロンプトに元のプロンプトを書き換えます。特に、MLLM評価のシナリオに合わせたプロンプトのカスタマイズのためのいくつかの設計されたモジュールを提案します。幅広い実験により、当社のアプローチがモデルの能力を明らかにする効果を実証し、TP-Evalはより包括的かつ説得力のあるMLLM評価ベンチマークの開発にコミュニティに利益をもたらすべきです。

English

Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity - minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models' capabilities, underestimating the models' performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models' potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models' capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.

TP-Eval：カスタマイズされたプロンプトによる評価で、タップマルチモーダルLLMの潜在能力を引き出す

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

要旨

Support