ChatPaper.aiChatPaper

KOFFVQA:一個針對韓語大型視覺-語言模型的客觀評估自由形式視覺問答基準

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

March 31, 2025
作者: Yoonshik Kim, Jaeyoon Jung
cs.AI

摘要

近期大型視覺語言模型(VLMs)的興起,催生了多種評估此類模型的基準。然而,我們觀察到,現有的評估方法大多存在缺陷:它們要么要求模型從預設選項中選擇答案,犧牲了開放性;要么依賴於評判模型來評估回答,導致評估結果主觀且不可靠。此外,我們注意到韓語環境下缺乏針對VLMs的基準測試,而這作為與常見的英語基準相區分的獨立指標是必要的,因為生成式語言模型的表現會因使用語言的不同而顯著差異。因此,我們推出了KOFFVQA,一個專為評估VLMs設計的韓語通用自由形式視覺問答基準。該基準包含275個精心設計的問題,每個問題均配有一幅圖像及涵蓋VLM性能十個不同方面的評分標準。通過預先確定的評分規則,評判模型能夠對每個回答進行評分,從而解決了評估不可靠的問題。以客觀方式定義評估標準,即便是小型開源模型也能可靠地在我們的基準上進行評估。除了在基準上評估大量現有VLMs外,我們還通過實驗驗證了使用預設評分標準進行評估的方法,其可靠性遠超現有方法。我們的評估代碼已公開於https://github.com/maum-ai/KOFFVQA。
English
The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA

Summary

AI-Generated Summary

PDF42April 1, 2025