VisOnlyQA：大視覚言語モデルは、幾何学情報の視覚認識にまだ苦労しています。

要旨

画像中の視覚情報の理解における誤り（すなわち、視覚認識の誤り）は、大規模ビジョン言語モデル（LVLMs）におけるミスの主要な原因のままです。さらなる分析が不可欠ですが、LVLMsの視覚認識を評価するためのデータセットには不足があります。本研究では、科学図表における幾何学的および数値情報に関する質問について、LVLMsの視覚認識能力を直接評価するために設計された新しいデータセットであるVisOnlyQAを紹介します。当該データセットにより、推論などの他の能力とは独立して、LVLMsの詳細な視覚情報の視覚認識を分析することが可能となります。VisOnlyQAの評価セットには、4つの図表カテゴリに関する12のタスクで合計1,200の多肢選択問題が含まれています。また、70,000のインスタンスからなる合成トレーニングデータも提供しています。VisOnlyQAにおける実験結果は、以下の点を強調しています：（i）GPT-4oやGemini 1.5 Proを含む20のLVLMsは、VisOnlyQAにおける視覚認識タスクにおいて不十分な結果を示し、一方で人間のパフォーマンスはほぼ完璧です。（ii）合成トレーニングデータでのファインチューニングは、LVLMsの視覚認識を向上させる可能性を示唆していますが、観察された改善は特定のタスクと特定のモデルに限定されています。（iii）より強力な言語モデルは、LVLMsの視覚認識を向上させます。要約すると、我々の実験は、LVLMsの視覚認識能力を向上させるためには、トレーニングデータとモデルアーキテクチャの両方を改善する必要があることを示唆しています。データセット、コード、およびモデルの応答は、https://github.com/psunlpgroup/VisOnlyQA で提供されています。

English

Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision Language Models (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger language models improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

VisOnlyQA：大視覚言語モデルは、幾何学情報の視覚認識にまだ苦労しています。

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

要旨

Support