VaseVQA：古代ギリシャ陶器のためのマルチモーダルエージェントとベンチマーク

要旨

文化遺産の遺物を分析することは、MLLM（マルチモーダル言語モデル）にとって依然として課題である。一般的なモデルはドメイン知識を欠いており、SFT（Supervised Fine-Tuning）はしばしば表面的なパターンに過剰適合し、認証や歴史的帰属のための脆弱な推論を生み出す。これにより、古代ギリシャ陶器に対する専門家レベルの堅牢な推論能力をMLLMにどのように備えさせるかという疑問が生じる。本論文では、評価を監督に変換するSFT-then-RLシステムであるVaseVLを提案する。具体的には、質問タイプの分類体系を構築し、SFTモデルをプローブしてタイプ固有の性能ギャップを特定し、それらのギャップをターゲットとしたタイプ条件付きで構成性指向の報酬を用いて最適化を行う。また、深い理解を探るために設計された31,773枚の画像からなる包括的なベンチマークであるVaseVQAを公開する。実験結果は、スタイル分類と歴史的帰属において最先端の結果を示し、SFTのみのベースラインと比較して構成性の堅牢性が顕著に向上していることを確認し、診断主導型で分類体系条件付きの報酬設計の有効性を検証するとともに、将来の研究のための再利用可能なリソースを提供する。コードとデータセットはhttps://github.com/AIGeeksGroup/VaseVQAで公開予定である。

English

Analyzing cultural-heritage artifacts remains challenging for MLLMs: general models lack domain expertise, and SFT often overfits superficial patterns, yielding brittle reasoning for authentication and historical attribution. This raises the question of how to equip MLLMs with robust, expert-level reasoning for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns evaluation into supervision: we construct a taxonomy of question types, probe the SFT model to localize type-specific performance gaps, and optimize with type-conditioned, compositionality-oriented rewards targeting those gaps. We also release VaseVQA, a comprehensive benchmark of 31,773 images designed to probe deep understanding. Experiments show state-of-the-art results on style classification and historical attribution with marked gains in compositional robustness over SFT-only baselines, validating diagnosis-guided, taxonomy-conditioned reward engineering and providing a reusable resource for future research. Code and dataset will be available at https://github.com/AIGeeksGroup/VaseVQA.

VaseVQA：古代ギリシャ陶器のためのマルチモーダルエージェントとベンチマーク

VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery

要旨

Support